Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Last updated
Copyright Continuum Labs - 2023
Last updated
This December 2023 paper introduced Direct Preference Optimization (DPO), an approach for fine-tuning large unsupervised language models (LMs) to align with human preferences without the complexities and instabilities associated with traditional reinforcement learning from human feedback (RLHF).
Language models are traditionally trained on massive, unsupervised datasets derived from diverse human-generated content.
Steering these models to produce desired outputs typically involves RLHF.
This method first involves constructing a reward model based on human feedback about the quality of model outputs. The language model is then fine-tuned via reinforcement learning to maximize the reward outputs, ensuring it does not deviate significantly from its original training.
RLHF is computationally demanding and complex.
It requires maintaining a balance between adhering to the learned rewards and not diverging from the model's original capabilities. This balancing act can lead to instability in training and necessitates extensive computational resources, mainly because it involves iterative sampling from the language model during the training process.
DPO simplifies the preference learning process significantly.
Unlike RLHF, which requires separate stages for learning a reward model and then fine-tuning the language model based on this reward, DPO integrates preference alignment directly into the language model training process using a simple classification loss.
Preference Modeling: DPO leverages a model of human preferences (such as the Bradley-Terry model) to directly influence the training of the language model. Instead of training a separate reward model, DPO modifies the language model's outputs to align with human preferences by adjusting the probabilities of preferred responses directly.
Optimisation Method: DPO uses a binary cross-entropy loss, which is much simpler than the methods used in traditional RLHF. This loss directly adjusts the likelihoods of outputs that align with human preferences, effectively steering the model in a desired direction without separate reward estimation.
Stability and Efficiency: By eliminating the need for sampling during training and reducing reliance on complex hyperparameter tuning, DPO is both computationally efficient and stable compared to RLHF.
The paper's experiments demonstrate that DPO can align language models with human preferences effectively, often outperforming or matching the capabilities of RLHF methods like Proximal Policy Optimization (PPO) in tasks such as sentiment control, summarisation, and dialogue generation.
Notably, DPO achieves these results with simpler implementation and lower computational overhead.
The introduction of DPO represents a significant simplification in the process of aligning LMs with human preferences.
By optimising directly for the policy that best satisfies these preferences, DPO eliminates the need for separate reward model training and complex RL procedures, offering a more efficient and direct path to enhancing LM performance across various applications.
This analysis highlights a robust convergence of techniques and methodologies from reinforcement learning, supervised learning, and human-centered design to enhance the performance and applicability of LMs.
The related work provides a solid foundation for understanding the complexities and challenges involved in training these models, as well as the innovative approaches being developed to address them.