Grouped Relative Policy Optimization (GRPO)
GRPO is an online RL method used in DeepSeek R1 paper and its first appearance is in DeepSeekMath
Implemented Variants
grpo.pyis the recommended GRPO implementation, built on OLMo-core's native training infrastructure (FSDP). It uses Ray for distributed training with vLLM inference.grpo_fast.pyis a faster variant using packing techniques with DeepSpeed.
grpo.py (OLMo-core)
Debug Scripts
| Script | Scale | Launch |
|---|---|---|
scripts/train/debug/single_gpu_grpo.sh |
1 GPU, Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/single_gpu_grpo.sh |
scripts/train/debug/multi_node_grpo.sh |
2 nodes (16 GPUs), Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/multi_node_grpo.sh |
scripts/train/debug/tools/olmo_3_parser_multigpu.sh |
2 nodes, tools, Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/tools/olmo_3_parser_multigpu.sh |
Olmo 3 Scripts
| Script | Scale | Description | Launch |
|---|---|---|---|
scripts/train/olmo3/7b_instruct_rl.sh |
8 nodes (64 GPUs) | Olmo 3 7B Instruct GRPO with multi-task reasoning datasets | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_instruct_rl.sh |
scripts/train/olmo3/7b_think_rl.sh |
4 nodes (32 GPUs) | Olmo 3 7B Think GRPO with pipeline RL | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_think_rl.sh |
scripts/train/olmo3/32b_instruct_rl.sh |
12 nodes (96 GPUs) | Olmo 3 32B Instruct GRPO | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/32b_instruct_rl.sh |
scripts/train/olmo3/32b_think_rl.sh |
28 nodes (224 GPUs) | Olmo 3 32B Think GRPO | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/32b_think_rl.sh |
scripts/train/olmo3/7b_rlzero_math.sh |
9 nodes (72 GPUs) | Olmo 3 7B RL-Zero for math | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_rlzero_math.sh |
scripts/train/olmo3/7b_rlzero_code.sh |
5 nodes (40 GPUs) | Olmo 3 7B RL-Zero for code | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_rlzero_code.sh |
scripts/train/olmo3/7b_rlzero_general.sh |
5 nodes (40 GPUs) | Olmo 3 7B RL-Zero for general tasks | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_rlzero_general.sh |
scripts/train/olmo3/7b_rlzero_instruction_following.sh |
5 nodes (40 GPUs) | Olmo 3 7B RL-Zero for instruction following | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_rlzero_instruction_following.sh |
scripts/train/olmo3/7b_rlzero_mix.sh |
4 nodes (32 GPUs) | Olmo 3 7B RL-Zero mixed (code, IF, general) | ./scripts/train/build_image_and_launch.sh scripts/train/olmo3/7b_rlzero_mix.sh |
Key Flags
Both grpo.py and grpo_fast.py share the same config classes and accept the same flags.
| Group | Flag | Description | Default |
|---|---|---|---|
| Training | --learning_rate |
Initial learning rate | 2e-5 |
--lr_scheduler_type |
LR scheduler: linear, cosine, etc. |
linear |
|
--per_device_train_batch_size |
Forward batch size per device | 1 |
|
--total_episodes |
Total number of episodes in dataset | 100000 |
|
--num_epochs |
Number of epochs to train | 1 |
|
--num_mini_batches |
Mini-batches to split a batch into | 1 |
|
--seed |
Random seed | 1 |
|
| GRPO Algorithm | --beta |
KL coefficient for RLHF objective | 0.05 |
--clip_lower |
Lower clip range | 0.2 |
|
--clip_higher |
Higher clip range (see DAPO) | 0.2 |
|
--loss_fn |
Loss function: dapo or cispo |
dapo |
|
--load_ref_policy |
Load and use reference policy for KL | True |
|
| Rollout / Sampling | --num_unique_prompts_rollout |
Unique prompts per rollout | 16 |
--num_samples_per_prompt_rollout |
Samples per prompt in rollout | 4 |
|
--temperature |
Sampling temperature | 0.7 |
|
--max_prompt_token_length |
Max tokens for prompts | 256 |
|
--response_length |
Token length for responses | 256 |
|
--pack_length |
Total pack length for packing | 512 |
|
--async_steps |
Number of async generation steps | 1 |
|
--active_sampling |
Enable active sampling | False |
|
| Reward | --apply_verifiable_reward |
Apply verifiable reward | True |
--verification_reward |
Verification reward value | 10.0 |
|
--apply_r1_style_format_reward |
Apply R1-style format reward | False |
|
| Infrastructure | --deepspeed_stage |
DeepSpeed stage (0, 2, or 3) | 0 |
--sequence_parallel_size |
Sequence parallel size across GPUs | 1 |
|
--num_learners_per_node |
GPUs per node for training | [1] |
|
--single_gpu_mode |
Collocate vLLM and actor on same node | False |
|
| vLLM | --vllm_num_engines |
Number of vLLM engines | 1 |
--vllm_tensor_parallel_size |
Tensor parallelism size | 1 |
|
--vllm_gpu_memory_utilization |
GPU memory utilization ratio | 0.9 |
|
| Model | --model_name_or_path |
Model checkpoint for weight initialization | โ |
--gradient_checkpointing |
Use gradient checkpointing | False |
|
--chat_template_name |
Chat template to use | None |
|
| Saving | --output_dir |
Output directory for checkpoints | output |
--save_freq |
Save every N train steps | 200 |
|
--with_tracking |
Track experiment with Weights and Biases | False |
grpo_fast.py
This implementation has the following features:
- Uses packing techniques to speed up the training process, inspired by Open-Reasoner-Zero/Open-Reasoner-Zero
- Uses a thread-based approach to parallelize the training and inference processes, based on Asynchronous RLHF.
- Uses a data preparation thread to prepare the data for the training process.
In simpler tasks, we see 2x faster training, and even 10x faster for more complex tasks. With grpo_fast.py, we can run crank up number_samples_per_prompt and train on really large batch sizes.
It implements additional optimizations:
grpo_fast.pyalso implements an optimization to skip zero gradient batches. If we solve a prompt 100% correct or 0% correct, the std of the group is 0. Soadv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0, causing 0 gradients.grpo_fast.pywill skip these batches before packing the sequences.

Figure taken from this discord thread by @the_real_jrb
grpo_fast.pyonly applies the verification reward if the format reward is enabled (via--additive_format_reward Falseby default). See (allenai/open-instruct/pull/659). A direct additive format reward is undesirable. In GRPO, the scale of the rewards is not relevant due to group normalization. For example, a group of [0, 0, 0, 0, 10], [0, 0, 0, 0, 11], [0, 0, 0, 0, 1] reward will have the same advantage.
Now imagine there are cases where the model generates a really long response (8k) gen length, but only get the format reward right, GRPO will push up the probs for this long response even though the response is not really correct. As a result, when using the format reward directly, we see the response length of unsolved prompts to fluctuate significantly, causing stability issues.

Debug Scripts
| Script | Scale | Launch |
|---|---|---|
scripts/train/debug/grpo_fast.sh |
1 GPU, local | bash scripts/train/debug/grpo_fast.sh |
scripts/train/debug/grpo_fast_3_gpu.sh |
3 GPUs (2 train, 1 inference), local | bash scripts/train/debug/grpo_fast_3_gpu.sh |
scripts/train/debug/grpo_integration_test.sh |
1 GPU, Beaker | ./scripts/train/build_image_and_launch.sh scripts/train/debug/grpo_integration_test.sh |
grpo_fast.py accepts the same flags as grpo.py. See the Key Flags table above.
Reproduce allenai/Llama-3.1-Tulu-3.1-8B (1 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3.1-8B model by running the following command:
bash scripts/train/tulu3/grpo_fast_8b_single_node.sh
Info
Here the grpo_fast.py actually use 6 GPUs for training and 2 GPUs for inference, so it's using less hardware but runs faster than the legacy grpo_vllm_thread_ray_gtrl.py which used 2 nodes (12 GPUs for training and 4 GPUs for inference).

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Qwen 2.5 7B GRPO Fast Zero-style
We have
bash scripts/train/qwen/grpo_fast_7b.sh

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Olmo2 7B GRPO Fast Zero-style
We have
bash scripts/train/olmo2/grpo_fast_7b_zero.sh

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Olmo2 13B GRPO Fast Zero-style
We have
bash scripts/train/olmo2/grpo_fast_13b_zero.sh

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
grpo_fast.py includes the following additional metrics beyond the standard training metrics:
other/real_batch_size_ratio: In GRPO, as we train we actually get smaller and smaller batch sizes. This is because if we solve a prompt 100% correct or 0% correct, the std of the group is 0. Soadv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0, causing 0 gradients. This metric is the ratio of the samples that have gradients vs the total number of samples,other/packed_ratio: The ratio of the packed sequences vs the total number of sequences. The lower the ratio, the more efficiently we have packed the sequences. E.g., if we have 100 sequences and the ratio is 0.1, it means we only have to do 10% of the forward passes than if we didn't pack.
Reproduce allenai/Llama-3.1-Tulu-3.1-8B (2 Nodes)
These results were produced with the legacy grpo_vllm_thread_ray_gtrl.py script, which has since been removed. The experiments were run at commit 745bf58d321c. They are preserved here for historical reference. See the original launch script for the launch command.

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

Reproduce allenai/OLMo-2-1124-7B-Instruct but better (2 Nodes)
These results were produced with the legacy grpo_vllm_thread_ray_gtrl.py, which has since been removed. See the deleted script for the original launch command.

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
Based on our internal evaluation, the GRPO model actually outperforms the original allenai/OLMo-2-1124-7B-Instruct model. This is mostly because the original allenai/OLMo-2-1124-7B-Instruct was trained with PPO, which may suffer from not using a outcome reward model to initialize the value model (since it uses a genreal RM to initialize the value model). Note that your results may vary slightly due to the random seeds used in the training.

(๐งช Experimental) Qwen 2.5 7B Zero-style
These results were produced with the legacy grpo_vllm_thread_ray_gtrl.py, which has since been removed. See the deleted script for the original launch command. Training was done on ai2-adapt-dev/math_ground_truth_zs starting from a base model, similar to DeepSeek R1.

๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
During training, the following metrics are logged:
episode: the global episode number training has gone through (e.g.,3000means we have trained on 3000 data points already -- in the case of RLVR that is prompts, which can repeat)lr: the current learning rateepoch: the fraction or multiple of the epoch (e.g.,2.7means we have trained on the dataset for 2 epochs and 70% of the third epoch)objective/kl: the KL divergence between the current policy and the reference policy (sum of the KL divergence of each response token)objective/scores: the scores of the current response, rated by a combination of reward model and other rewards (e.g., R1 style format reward, verifiable reward, etc.)objective/rlhf_reward: the RLHF reward, which isobjective/scores-beta*objective/klobjective/non_score_reward:beta*objective/klobjective/entropy: the entropy of the current policyobjective/loss: the GRPO lossobjective/kl2: the second variant of KL divergence used in the training process, calculated similarly toobjective/klobjective/kl3: the third variant of KL divergence used in the training process, providing additional insights into policy divergenceobjective/scores_mean: the mean of the scores of the current response, providing an average measure of response qualityobjective/reward_std: the standard deviation of the rewards, indicating the variability in the reward distributionobjective/verifiable_correct_rate: the rate at which responses are verifiably correct, providing a measure of response accuracyloss/policy_avg: the average policy loss, indicating the mean loss incurred during policy updatespolicy/approxkl_avg: the average approximate KL divergence, used to monitor policy stabilitypolicy/clipfrac_avg: the average fraction of updates where the policy was clipped, indicating how often clipping occurspolicy/entropy_avg: the average entropy of the policy, providing a measure of policy randomnesstime/from_scratch: the time taken to train the model from scratchtime/training: the time taken to do one training stepval/sequence_lengths: the length of the sequences in the generated responsesval/num_stop_token_ids: the number of stop tokens in the generated responsesval/ratio: the mean ratio of the new policy to the old policy, used to assess policy updatesval/ratio_var: the variance of the ratio of the new policy to the old policy, indicating the variability in policy updatesval/stop_token_rate: the rate at which stop tokens appear in the responses, providing a measure of response terminationval/format_scores: the mean format scores, indicating the quality of response formatting (only logged ifadd_r1_style_format_rewardis enabled)other/real_batch_size_ratio: In GRPO, as we train we actually get smaller and smaller batch sizes. This is because if we solve a prompt 100% correct or 0% correct, the std of the group is 0. Soadv = (score - score.mean()) / (score.std + 1e-5) = 0 / 1e-5 = 0, causing 0 gradients. This metric is the ratio of the samples that have gradients vs the total number of samples,other/packed_ratio: The ratio of the packed sequences vs the total number of sequences. The lower the ratio, the more efficiently we have packed the sequences. E.g., if we have 100 sequences and the ratio is 0.1, it means we only have to do 10% of the forward passes than if we didn't pack.
Acknowledgements
We would like to thank the following resources for GRPO theory:
We would like to thank the following resources for GRPO implementation and Ray usage:
We would like to thank the following projects for general infrastructure: