Direct Preference Optimization (DPO)

We support Direct Preference Optimization (DPO) training on a variety of datasets.

Implemented Variants

dpo_tune_cache.py is the DPO implementation that directly optimizes model outputs based on human preferences.

`dpo_tune_cache.py`

This implementation has the following key features:

Auto save the trained checkpoint to HuggingFace Hub
Supports LigerKernel for optimized training with fused operations
Implements the DPO algorithm for direct preference optimization

There are several relevant implementation details:

To save memory, we 1) cache the logprobs of the reference model on the dataset, 2) remove the reference model from the memory after the logprobs are computed. This means that you won't see the initial training losses for a while until the logprobs are computed.
We use the dpo_norm loss type by default, which is a length-normalized loss. See the SimPO paper for more details.

Debug (Single GPU)

You can run the script in a single GPU mode to debug the training process.

bash scripts/train/debug/dpo.sh

Reproduce `allenai/Llama-3.1-Tulu-3-8B-DPO` (4 Nodes)

You can reproduce our allenai/Llama-3.1-Tulu-3-8B-DPO model by running the following command:

bash scripts/train/tulu3/dpo_8b.sh

dpo_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the DPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3-8B-DPO model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

dpo_plot

For example, DROP is lower than the reference, but DROP can be quite brittle due to parsing issues (see below).

dpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Reproduce `allenai/OLMo-2-1124-7B-DPO` (4 Nodes)

You can reproduce our allenai/OLMo-2-1124-7B-DPO model by running the following command:

bash scripts/train/olmo2/dpo_7b.sh

Info

If you are an external user, mason.py will print out the actual command being executed on our internal server, so you can modify the command as needed.

dpo_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the DPO model is roughly on par with the original allenai/OLMo-2-1124-7B-DPO model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

dpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Reproduce `allenai/OLMo-2-1124-13B-DPO` (4 Nodes)

You can reproduce our allenai/OLMo-2-1124-13B-DPO model by running the following command:

bash scripts/train/olmo2/dpo_13b.sh

dpo_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the DPO model is roughly on par with the original allenai/OLMo-2-1124-13B-DPO model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

dpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Training Metrics

During training, the following metrics are logged:

training_step: Current training step
learning_rate: The current learning rate from the learning rate scheduler
epoch: Current epoch (as a fraction of total dataset)
train_loss: The average training loss over the logged steps
logps/chosen: Average log probabilities for chosen responses
logps/rejected: Average log probabilities for rejected responses

For DPO and DPO-norm loss types, additional metrics are logged:

rewards/chosen: Average rewards for chosen responses
rewards/rejected: Average rewards for rejected responses
rewards/average: Average of chosen and rejected rewards
rewards/accuracy: Accuracy of preference prediction
rewards/margin: Margin between chosen and rejected rewards

When using load balancing loss (for OLMoE), the following metric is also logged:

aux_loss: Auxiliary loss for load balancing

The metrics are logged every logging_steps steps (if specified) and provide insights into:

Training progress (loss, learning rate, epoch)
Model behavior (log probabilities, rewards)
Preference learning (accuracy, margin)
Resource utilization (auxiliary losses)

Acknowledgements

We would like to thank the following projects for general infrastructure:

Direct Preference Optimization (DPO)

Implemented Variants

dpo_tune_cache.py

Debug (Single GPU)

Reproduce allenai/Llama-3.1-Tulu-3-8B-DPO (4 Nodes)

Reproduce allenai/OLMo-2-1124-7B-DPO (4 Nodes)

Reproduce allenai/OLMo-2-1124-13B-DPO (4 Nodes)

Training Metrics

Acknowledgements

`dpo_tune_cache.py`

Reproduce `allenai/Llama-3.1-Tulu-3-8B-DPO` (4 Nodes)

Reproduce `allenai/OLMo-2-1124-7B-DPO` (4 Nodes)

Reproduce `allenai/OLMo-2-1124-13B-DPO` (4 Nodes)