Papers
arxiv:2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Published on May 29, 2023
Authors:
,
,
,
,

Abstract

Direct Preference Optimization (DPO) fine-tunes language models more efficiently and with better performance compared to reinforcement learning from human feedback (RLHF), by directly optimizing policy training based on human preferences.

AI-generated summary

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Community

Unlocking Language Models: Direct Preference Optimization

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Bookmark the equation of DPO and related latex code here at http://www.deepnlp.org/equation/direct-preference-optimization-dpo, which can help save your time and avoid typing extremely long equations.

image.png

For more AI related equations and latex code, you can checkout the equation search engine at http://www.deepnlp.org/search/equation

This blog post needs a shoutout :
https://www.tylerromero.com/posts/2024-04-dpo/

Sign up or log in to comment

Models citing this paper 1,000+

Browse 1,000+ models citing this paper

Datasets citing this paper 15

Browse 15 datasets citing this paper

Spaces citing this paper 20,844

Collections including this paper 50