Supervised finetuning (SFT)
We support Supervised finetuning (SFT) on a variety of datasets.
Implemented Variants
- OLMo-core SFT (recommended): Uses OLMo-core's native training infrastructure. Most users should use this — it is more GPU-efficient and supports OLMo, Qwen, and other models. See
open_instruct/olmo_core_utils.pyfor the current list of supported models. finetune.pyis the legacy SFT implementation using DeepSpeed/Accelerate. Use this only if you need a model architecture not yet supported by OLMo-core.
OLMo-core SFT
The recommended SFT implementation uses OLMo-core's SFT training script.
Setup
OLMo-core SFT requires a separate OLMo-core clone — the build_image_and_launch.sh script only works for open-instruct jobs (DPO, RL), not for SFT.
- Clone OLMo-core:
git clone https://github.com/allenai/OLMo-core.git - Follow the OLMo-core setup instructions to install dependencies.
- Run the SFT training script from the OLMo-core checkout. For example:
cd /path/to/OLMo-core
python src/scripts/train/sft/OLMo-sft.py train \
my-experiment-name \
gs://my-bucket/checkpoint/path \
ai2/jupiter \
--trainer.max_duration.value=2 \
--train_module.optim.lr=8e-5 \
--seq_len=32768 \
--launch.num_gpus=8 \
--num_nodes=4 \
--global_batch_size=1048576 \
--model_name=olmo3-7b
Key Flags
OLMo-core SFT uses CLI flags (not YAML config files). Key options include:
| Flag | Description |
|---|---|
--trainer.max_duration.value |
Number of training epochs |
--train_module.optim.lr |
Learning rate |
--seq_len |
Maximum sequence length |
--launch.num_gpus |
GPUs per node |
--num_nodes |
Number of nodes |
--global_batch_size |
Global batch size in tokens |
--model_name |
Model architecture (e.g., olmo3-7b, olmo3-32b) |
--dataset_path |
Path to the SFT dataset |
--trainer.callbacks.wandb.enabled |
Enable W&B logging |
See the OLMo-core documentation for the full list of options.
Olmo 3 SFT Scripts
These scripts are run from the OLMo-core checkout, not via build_image_and_launch.sh.
| Script | Scale | Description |
|---|---|---|
scripts/train/olmo3/7b_instruct_sft.sh |
4 nodes (32 GPUs) | Olmo 3 7B Instruct SFT, 32k sequence length |
scripts/train/olmo3/7b_think_sft.sh |
4 nodes (32 GPUs) | Olmo 3 7B Think SFT, 32k sequence length |
scripts/train/olmo3/32b_instruct_sft.sh |
8 nodes (64 GPUs) | Olmo 3 32B Instruct SFT, 32k sequence length |
scripts/train/olmo3/32b_think_sft.sh |
32 nodes (256 GPUs) | Olmo 3 32B Think SFT, 32k sequence length |
finetune.py (Legacy)
This implementation has the following key features:
- Auto save the trained checkpoint to HuggingFace Hub
- Supports LigerKernel for optimized training with fused operations
Debug Scripts
| Script | Scale | Launch |
|---|---|---|
scripts/train/debug/finetune.sh |
1 GPU, local | bash scripts/train/debug/finetune.sh |
scripts/train/debug/sft_integration_test.sh |
1 GPU, Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/sft_integration_test.sh |
scripts/train/debug/sft_multinode_test.sh |
2 nodes, Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/sft_multinode_test.sh |

Key Flags
| Group | Flag | Description | Default |
|---|---|---|---|
| Model | --model_name_or_path |
Model checkpoint for weight initialization | — |
--use_liger_kernel |
Use LigerKernel for optimized training | False |
|
| Training | --learning_rate |
Initial learning rate | 2e-5 |
--num_train_epochs |
Total number of training epochs | 2 |
|
--per_device_train_batch_size |
Batch size per GPU | 8 |
|
--gradient_accumulation_steps |
Gradient accumulation steps | 1 |
|
--max_seq_length |
Maximum sequence length after tokenization | — | |
--warmup_ratio |
Linear warmup fraction of total steps | 0.03 |
|
--lr_scheduler_type |
LR scheduler: linear, cosine, etc. |
linear |
|
--gradient_checkpointing |
Use gradient checkpointing (saves memory) | False |
|
--seed |
Random seed | 42 |
|
| Data | --dataset_mixer_list |
List of datasets (local or HF) to sample from | — |
--dataset_mixer_list_splits |
Dataset splits for training | ["train"] |
|
--chat_template_name |
Chat template to use | None |
|
--packing |
Use packing/padding-free collation | False |
|
| Parallelism | --sequence_parallel_size |
Ulysses sequence parallelism degree | 1 |
| LoRA | --use_lora |
Use LoRA for parameter-efficient training | False |
--lora_rank |
Rank of LoRA | 64 |
|
--lora_alpha |
Alpha parameter of LoRA | 16 |
|
| Checkpointing | --output_dir |
Output directory for checkpoints | output/ |
--checkpointing_steps |
Save every N steps or epoch |
— | |
--resume_from_checkpoint |
Resume from checkpoint folder | None |
|
| Logging | --with_tracking |
Track experiment with Weights and Biases | False |
--logging_steps |
Log training loss every N steps | — |
Reproduce allenai/Llama-3.1-Tulu-3-8B-SFT (8 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3-8B-SFT model by running the following command:
bash scripts/train/tulu3/finetune_8b.sh
Info
If you are an external user, mason.py will print out the actual command being executed on our internal server, so you can modify the command as needed.


👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/Llama-3.1-Tulu-3-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-7B-SFT (8 Nodes)
You can reproduce our allenai/OLMo-2-1124-7B-SFT model by running the following command:
bash scripts/train/olmo2/finetune_7b.sh

👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-13B-SFT (8 Nodes)
You can reproduce our allenai/OLMo-2-1124-13B-SFT model by running the following command:
bash scripts/train/olmo2/finetune_13b.sh

👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Reproduce allenai/OLMo-2-1124-32B-SFT (8 Nodes)
You can reproduce our allenai/OLMo-2-1124-32B-SFT model by running the following command:
bash scripts/train/olmo2/finetune_32b.sh

👉 Tracked WandB Experiments (Click to expand)
Info
Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
During training, the following metrics are logged:
learning_rate: The current learning rate from the learning rate schedulertrain_loss: The average training loss over the logged stepstotal_tokens: Total number of tokens processed (excluding padding)per_device_tps: Tokens per second processed per device (excluding padding)total_tokens_including_padding: Total number of tokens including padding tokensper_device_tps_including_padding: Tokens per second processed per device (including padding)
The metrics are logged every logging_steps steps (if specified) and provide insights into:
- Training progress (loss, learning rate)
- Training efficiency (tokens per second)
- Resource utilization (padding vs non-padding tokens)
Acknowledgements
We would like to thank the following projects for general infrastructure: