Direct Preference Optimization (DPO)
We support Direct Preference Optimization (DPO) training on a variety of datasets.
Implemented Variants
dpo_tune_cache.py
is the DPO implementation that directly optimizes model outputs based on human preferences.
dpo_tune_cache.py
This implementation has the following key features:
- Auto save the trained checkpoint to HuggingFace Hub
- Supports LigerKernel for optimized training with fused operations
- Implements the DPO algorithm for direct preference optimization
There are several relevant implementation details:
- To save memory, we 1) cache the logprobs of the reference model on the dataset, 2) remove the reference model from the memory after the logprobs are computed. This means that you won't see the initial training losses for a while until the logprobs are computed.
- We use the
dpo_norm
loss type by default, which is a length-normalized loss. See the SimPO paper for more details.
Debug (Single GPU)
You can run the script in a single GPU mode to debug the training process.
bash scripts/train/debug/dpo.sh
Reproduce allenai/Llama-3.1-Tulu-3-8B-DPO
(4 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3-8B-DPO
model by running the following command:
bash scripts/train/tulu3/dpo_8b.sh
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the DPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3-8B-DPO
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
For example, DROP is lower than the reference, but DROP can be quite brittle due to parsing issues (see below).
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-7B-DPO
(4 Nodes)
You can reproduce our allenai/OLMo-2-1124-7B-DPO
model by running the following command:
bash scripts/train/olmo2/dpo_7b.sh
Info
If you are an external user, mason.py
will print out the actual command being executed on our internal server, so you can modify the command as needed.
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the DPO model is roughly on par with the original allenai/OLMo-2-1124-7B-DPO
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-13B-DPO
(4 Nodes)
You can reproduce our allenai/OLMo-2-1124-13B-DPO
model by running the following command:
bash scripts/train/olmo2/dpo_13b.sh
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the DPO model is roughly on par with the original allenai/OLMo-2-1124-13B-DPO
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
During training, the following metrics are logged:
training_step
: Current training steplearning_rate
: The current learning rate from the learning rate schedulerepoch
: Current epoch (as a fraction of total dataset)train_loss
: The average training loss over the logged stepslogps/chosen
: Average log probabilities for chosen responseslogps/rejected
: Average log probabilities for rejected responses
For DPO and DPO-norm loss types, additional metrics are logged:
rewards/chosen
: Average rewards for chosen responsesrewards/rejected
: Average rewards for rejected responsesrewards/average
: Average of chosen and rejected rewardsrewards/accuracy
: Accuracy of preference predictionrewards/margin
: Margin between chosen and rejected rewards
When using load balancing loss (for OLMoE), the following metric is also logged:
aux_loss
: Auxiliary loss for load balancing
The metrics are logged every logging_steps
steps (if specified) and provide insights into:
- Training progress (loss, learning rate, epoch)
- Model behavior (log probabilities, rewards)
- Preference learning (accuracy, margin)
- Resource utilization (auxiliary losses)
Acknowledgements
We would like to thank the following projects for general infrastructure: