Supervised finetuning (SFT)

We support Supervised finetuning (SFT) on a variety of datasets.

Implemented Variants

OLMo-core SFT (recommended): Uses OLMo-core's native training infrastructure. Most users should use this — it is more GPU-efficient and supports OLMo, Qwen, and other models. See open_instruct/olmo_core_utils.py for the current list of supported models.
finetune.py is the legacy SFT implementation using DeepSpeed/Accelerate. Use this only if you need a model architecture not yet supported by OLMo-core.

`olmo_core_finetune.py` (OLMo-core)

The recommended SFT implementation uses OLMo-core's native training infrastructure (FSDP/HSDP). It supports torch.compile, padding-free training, budget-mode activation checkpointing, and W&B tracking. It is launched via build_image_and_launch.sh, just like DPO and GRPO.

Debug Scripts

Script	Scale	Description	Launch
`scripts/train/debug/oc_sft.sh`	1 GPU, Beaker	Single-GPU test with Qwen3-0.6B.	`./scripts/train/build_image_and_launch.sh scripts/train/debug/oc_sft.sh`
`scripts/train/debug/oc_sft_multinode.sh`	2 nodes (16 GPUs), Beaker	Multi-node test with Qwen3-0.6B. Exercises HSDP sharding.	`./scripts/train/build_image_and_launch.sh scripts/train/debug/oc_sft_multinode.sh`

Key Flags

Group	Flag	Description	Default
Experiment	`--exp_name`	Name of this experiment	`"sft"`
	`--run_name`	Unique run name (for W&B)	`None`
	`--seed`	Random seed for initialization and dataset shuffling	`42`
Model	`--model_name_or_path`	Model checkpoint for weight initialization	—
	`--config_name`	Pretrained config name or path if different from model	`None`
	`--attn_implementation`	Attention backend: `flash-2`, `flash-3`, `torch` (auto-detected if unset)	`None`
Training	`--learning_rate`	Initial learning rate	`8e-5`
	`--num_epochs`	Total number of training epochs	`3`
	`--max_train_steps`	If set, overrides `num_epochs`	`None`
	`--per_device_train_batch_size`	Batch size per GPU	`8`
	`--gradient_accumulation_steps`	Gradient accumulation steps	`1`
	`--max_seq_length`	Maximum sequence length after tokenization	`4096`
	`--warmup_ratio`	Linear warmup fraction of total steps	`0.03`
	`--weight_decay`	Weight decay for AdamW	`0.0`
	`--max_grad_norm`	Maximum gradient norm for clipping (-1 = no clipping)	`-1`
	`--compile_model`	Apply `torch.compile` to model blocks	`True`
	`--activation_memory_budget`	Activation checkpointing budget (0.0–1.0); values < 1.0 enable budget-mode checkpointing	`1.0`
Data	`--mixer_list`	List of datasets (local or HF) to sample from	`allenai/tulu-3-sft-olmo-2-mixture`
	`--mixer_list_splits`	Dataset splits for training	`["train"]`
	`--transform_fn`	List of transform functions to apply to the dataset	`sft_tulu_tokenize_and_truncate_v1`, `sft_tulu_filter_v1`
	`--cache_dataset_only`	Exit after caching the dataset	`False`
	`--skip_cache`	Skip dataset caching	`False`
Checkpointing	`--output_dir`	Output directory for checkpoints	`output/`
	`--checkpointing_steps`	Save a persistent checkpoint every N steps	`500`
	`--ephemeral_save_interval`	Temporary checkpoint cadence (must be ≤ `checkpointing_steps`)	`500`
Logging	`--with_tracking`	Enable Weights and Biases tracking	`False`
	`--wandb_project`	W&B project name	`"open_instruct_internal"`
	`--wandb_entity`	W&B entity (team)	`None`
	`--logging_steps`	Log metrics every N steps	`1`

Parallelism

The script automatically selects the data-parallel strategy based on the number of nodes:

Single node: FSDP (Fully Sharded Data Parallel)
Multiple nodes: HSDP (Hybrid Sharded Data Parallel) — shards within each node, replicates across nodes

See OLMo-core Sharding and Parallelism for more details on parallelism configuration.

`finetune.py` (Legacy)

This implementation has the following key features:

Auto save the trained checkpoint to HuggingFace Hub
Supports LigerKernel for optimized training with fused operations

Debug Scripts

Script	Scale	Launch
`scripts/train/debug/finetune.sh`	1 GPU, local	`bash scripts/train/debug/finetune.sh`
`scripts/train/debug/sft_integration_test.sh`	1 GPU, Beaker	`./scripts/train/build_image_and_launch.sh scripts/train/debug/sft_integration_test.sh`
`scripts/train/debug/sft_multinode_test.sh`	2 nodes, Beaker	`./scripts/train/build_image_and_launch.sh scripts/train/debug/sft_multinode_test.sh`

finetune

Key Flags

Group	Flag	Description	Default
Model	`--model_name_or_path`	Model checkpoint for weight initialization	—
	`--use_liger_kernel`	Use LigerKernel for optimized training	`False`
Training	`--learning_rate`	Initial learning rate	`2e-5`
	`--num_train_epochs`	Total number of training epochs	`2`
	`--per_device_train_batch_size`	Batch size per GPU	`8`
	`--gradient_accumulation_steps`	Gradient accumulation steps	`1`
	`--max_seq_length`	Maximum sequence length after tokenization	—
	`--warmup_ratio`	Linear warmup fraction of total steps	`0.03`
	`--lr_scheduler_type`	LR scheduler: `linear`, `cosine`, etc.	`linear`
	`--gradient_checkpointing`	Use gradient checkpointing (saves memory)	`False`
	`--seed`	Random seed	`42`
Data	`--dataset_mixer_list`	List of datasets (local or HF) to sample from	—
	`--dataset_mixer_list_splits`	Dataset splits for training	`["train"]`
	`--chat_template_name`	Chat template to use	`None`
	`--packing`	Use packing/padding-free collation	`False`
Parallelism	`--sequence_parallel_size`	Ulysses sequence parallelism degree	`1`
LoRA	`--use_lora`	Use LoRA for parameter-efficient training	`False`
	`--lora_rank`	Rank of LoRA	`64`
	`--lora_alpha`	Alpha parameter of LoRA	`16`
Checkpointing	`--output_dir`	Output directory for checkpoints	`output/`
	`--checkpointing_steps`	Save every N steps or `epoch`	—
	`--resume_from_checkpoint`	Resume from checkpoint folder	`None`
Logging	`--with_tracking`	Track experiment with Weights and Biases	`False`
	`--logging_steps`	Log training loss every N steps	—

Reproduce `allenai/Llama-3.1-Tulu-3-8B-SFT` (8 Nodes)

You can reproduce our allenai/Llama-3.1-Tulu-3-8B-SFT model by running the following command:

bash scripts/train/tulu3/finetune_8b.sh

Info

If you are an external user, mason.py will print out the actual command being executed on our internal server, so you can modify the command as needed.

tulu3_8b

finetune_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the SFT model is roughly on par with the original allenai/Llama-3.1-Tulu-3-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

finetune_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Reproduce `allenai/OLMo-2-1124-7B-SFT` (8 Nodes)

You can reproduce our allenai/OLMo-2-1124-7B-SFT model by running the following command:

bash scripts/train/olmo2/finetune_7b.sh

finetune_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

finetune_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Reproduce `allenai/OLMo-2-1124-13B-SFT` (8 Nodes)

You can reproduce our allenai/OLMo-2-1124-13B-SFT model by running the following command:

bash scripts/train/olmo2/finetune_13b.sh

finetune_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

finetune_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Reproduce `allenai/OLMo-2-1124-32B-SFT` (8 Nodes)

You can reproduce our allenai/OLMo-2-1124-32B-SFT model by running the following command:

bash scripts/train/olmo2/finetune_32b.sh

finetune_plot

👉 Tracked WandB Experiments (Click to expand)

Info

Based on our internal evaluation, the SFT model is roughly on par with the original allenai/OLMo-2-1124-7B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

finetune_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Training Metrics

During training, the following metrics are logged:

learning_rate: The current learning rate from the learning rate scheduler
train_loss: The average training loss over the logged steps
total_tokens: Total number of tokens processed (excluding padding)
per_device_tps: Tokens per second processed per device (excluding padding)
total_tokens_including_padding: Total number of tokens including padding tokens
per_device_tps_including_padding: Tokens per second processed per device (including padding)

The metrics are logged every logging_steps steps (if specified) and provide insights into: - Training progress (loss, learning rate) - Training efficiency (tokens per second) - Resource utilization (padding vs non-padding tokens)

Acknowledgements

We would like to thank the following projects for general infrastructure:

Supervised finetuning (SFT)

Implemented Variants

olmo_core_finetune.py (OLMo-core)

Debug Scripts

Key Flags

Parallelism

finetune.py (Legacy)

Debug Scripts

Key Flags

Reproduce allenai/Llama-3.1-Tulu-3-8B-SFT (8 Nodes)

Reproduce allenai/OLMo-2-1124-7B-SFT (8 Nodes)

Reproduce allenai/OLMo-2-1124-13B-SFT (8 Nodes)

Reproduce allenai/OLMo-2-1124-32B-SFT (8 Nodes)

Training Metrics

Acknowledgements

`olmo_core_finetune.py` (OLMo-core)

`finetune.py` (Legacy)

Reproduce `allenai/Llama-3.1-Tulu-3-8B-SFT` (8 Nodes)

Reproduce `allenai/OLMo-2-1124-7B-SFT` (8 Nodes)

Reproduce `allenai/OLMo-2-1124-13B-SFT` (8 Nodes)

Reproduce `allenai/OLMo-2-1124-32B-SFT` (8 Nodes)