view article Article Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO) By ariG23498 • Jan 19 • 32
view reply hi there.i think there is an error in your PPO description, actually, PPO does not explicitly penalize the KL divergence from the initial (reference) policy.
view article Article DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge By NormalUhr • Feb 7 • 243