Grouped Relative Policy Optimization (GRPO)
GRPO is an online RL method used in DeepSeek R1 paper and its first appearance is in DeepSeekMath
Implemented Variants
grpo_fast.py
is a faster variant using packing techniques.grpo_vllm_thread_ray_gtrl.py
is a more vanilla GRPO implementation, using vLLM and Ray.
grpo_fast.py
This implementation has the following features:
- Uses packing techniques to speed up the training process, inspired by Open-Reasoner-Zero/Open-Reasoner-Zero
- Uses a thread-based approach to parallelize the training and inference processes, based on Asynchronous RLHF.
- Uses a data preparation thread to prepare the data for the training process.
In simpler tasks, we see 2x faster training, and even 10x faster for more complex tasks. With grpo_fast.py
, we can run crank up number_samples_per_prompt
and train on really large batch sizes.
Debug (Single GPU)
You can run the script in a single GPU mode to debug the training process.
bash scripts/train/debug/grpo_fast_mini.sh
Reproduce allenai/Llama-3.1-Tulu-3.1-8B
(1 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3.1-8B
model by running the following command:
bash scripts/train/tulu3/grpo_fast_8b_single_node.sh
Info
Here the grpo_fast.py
actually use 6 GPUs for training and 2 GPUs for inference, so it's using less hardware but runs faster than grpo_vllm_thread_ray_gtrl.py
which uses 2 nodes (12 GPUs for training and 4 GPUs for inference).
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Qwen 2.5 7B GRPO Fast Zero-style
We have
bash scripts/train/qwen/grpo_fast_7b.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Olmo2 7B GRPO Fast Zero-style
We have
bash scripts/train/olmo2/grpo_fast_7b_zero.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
(๐งช Experimental) Olmo2 13B GRPO Fast Zero-style
We have
bash scripts/train/olmo2/grpo_fast_13b_zero.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
grpo_vllm_thread_ray_gtrl.py
This implementation has the following features:
- Uses a thread-based approach to parallelize the training and inference processes, based on Asynchronous RLHF.
- Uses vLLM and Ray to parallelize the training process, based on how OpenRLHF does it
Debug (Single GPU)
You can run the script in a single GPU mode to debug the training process.
bash scripts/train/debug/grpo.sh
Reproduce allenai/Llama-3.1-Tulu-3.1-8B
(2 Nodes)
You can reproduce our allenai/Llama-3.1-Tulu-3.1-8B
model by running the following command:
bash scripts/train/tulu3/grpo_8b.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B
model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.
Reproduce allenai/OLMo-2-1124-7B-Instruct
but better (2 Nodes)
You can reproduce our allenai/OLMo-2-1124-7B-Instruct
model by running the following command:
bash scripts/train/olmo2/grpo_7b.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
Based on our internal evaluation, the GRPO model actually outperforms the original allenai/OLMo-2-1124-7B-Instruct
model. This is mostly because the original allenai/OLMo-2-1124-7B-Instruct
was trained with PPO, which may suffer from not using a outcome reward model to initialize the value model (since it uses a genreal RM to initialize the value model). Note that your results may vary slightly due to the random seeds used in the training.
(๐งช Experimental) Qwen 2.5 7B Zero-style
Here is a command to run GRPO on the Qwen/Qwen2.5-7B
on ai2-adapt-dev/math_ground_truth_zs, which is simply a zero-shot version of the RLVR MATH dataset. The training is done starting from a base model, similar to how DeepSeek R1 does it.
bash scripts/train/qwen/grpo_7b.sh
๐ Tracked WandB Experiments (Click to expand)
Info
Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.
Info
We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!
Training Metrics
During training, the following metrics are logged:
episode
: the global episode number training has gone through (e.g.,3000
means we have trained on 3000 data points already -- in the case of RLVR that is prompts, which can repeat)lr
: the current learning rateepoch
: the fraction or multiple of the epoch (e.g.,2.7
means we have trained on the dataset for 2 epochs and 70% of the third epoch)objective/kl
: the KL divergence between the current policy and the reference policy (sum of the KL divergence of each response token)objective/scores
: the scores of the current response, rated by a combination of reward model and other rewards (e.g., R1 style format reward, verifiable reward, etc.)objective/rlhf_reward
: the RLHF reward, which isobjective/scores
-beta
*objective/kl
objective/non_score_reward
:beta
*objective/kl
objective/entropy
: the entropy of the current policyobjective/loss
: the GRPO lossobjective/kl2
: the second variant of KL divergence used in the training process, calculated similarly toobjective/kl
objective/kl3
: the third variant of KL divergence used in the training process, providing additional insights into policy divergenceobjective/scores_mean
: the mean of the scores of the current response, providing an average measure of response qualityobjective/reward_std
: the standard deviation of the rewards, indicating the variability in the reward distributionobjective/verifiable_correct_rate
: the rate at which responses are verifiably correct, providing a measure of response accuracyloss/policy_avg
: the average policy loss, indicating the mean loss incurred during policy updatespolicy/approxkl_avg
: the average approximate KL divergence, used to monitor policy stabilitypolicy/clipfrac_avg
: the average fraction of updates where the policy was clipped, indicating how often clipping occurspolicy/entropy_avg
: the average entropy of the policy, providing a measure of policy randomnesstime/from_scratch
: the time taken to train the model from scratchtime/training
: the time taken to do one training stepval/sequence_lengths
: the length of the sequences in the generated responsesval/num_stop_token_ids
: the number of stop tokens in the generated responsesval/ratio
: the mean ratio of the new policy to the old policy, used to assess policy updatesval/ratio_var
: the variance of the ratio of the new policy to the old policy, indicating the variability in policy updatesval/stop_token_rate
: the rate at which stop tokens appear in the responses, providing a measure of response terminationval/format_scores
: the mean format scores, indicating the quality of response formatting (only logged ifadd_r1_style_format_reward
is enabled)
Acknowledgements
We would like to thank the following resources for GRPO theory:
We would like to thank the following resources for GRPO implementation and Ray usage:
We would like to thank the following projects for general infrastructure: