Supervised finetuning (SFT)
We support Supervised finetuning (SFT) on a variety of datasets.
Implemented Variants
finetune.py
is the original SFT implementation.
finetune.py
This implementation has the following key features:
- Auto save the trained checkpoint to HuggingFace Hub
- Supports LigerKernel for optimized training with fused operations
Debug (Single GPU)
You can run the script in a single GPU mode to debug the training process.
bash scripts/train/debug/finetune.sh
Reproduce allenai/Llama-3.1-Tulu-3-8B-SFT
(8 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3-8B-SFT
model by running the following command:
bash scripts/train/tulu3/finetune_8b.sh
Info
If you are an external user, mason.py
will print out the actual command being executed on our internal server, so you can modify the command as needed.
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/Llama-3.1-Tulu-3-8B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-7B-SFT
(8 Nodes)
You can reproduce our allenai/OLMo-2-1124-7B-SFT
model by running the following command:
bash scripts/train/olmo2/finetune_7b.sh
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-13B-SFT
(8 Nodes)
You can reproduce our allenai/OLMo-2-1124-13B-SFT
model by running the following command:
bash scripts/train/olmo2/finetune_13b.sh
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-32B-SFT
(8 Nodes)
You can reproduce our allenai/OLMo-2-1124-32B-SFT
model by running the following command:
bash scripts/train/olmo2/finetune_32b.sh
👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
During training, the following metrics are logged:
learning_rate
: The current learning rate from the learning rate schedulertrain_loss
: The average training loss over the logged stepstotal_tokens
: Total number of tokens processed (excluding padding)per_device_tps
: Tokens per second processed per device (excluding padding)total_tokens_including_padding
: Total number of tokens including padding tokensper_device_tps_including_padding
: Tokens per second processed per device (including padding)
The metrics are logged every logging_steps
steps (if specified) and provide insights into:
- Training progress (loss, learning rate)
- Training efficiency (tokens per second)
- Resource utilization (padding vs non-padding tokens)
Acknowledgements
We would like to thank the following projects for general infrastructure: