Skip to content

Grouped Relative Policy Optimization (GRPO)

GRPO is an online RL method used in DeepSeek R1 paper and its first appearance is in DeepSeekMath

Implemented Variants

  • grpo_fast.py is a faster variant using packing techniques.
  • grpo_vllm_thread_ray_gtrl.py is a more vanilla GRPO implementation, using vLLM and Ray.

grpo_fast.py

This implementation has the following features:

  • Uses packing techniques to speed up the training process, inspired by Open-Reasoner-Zero/Open-Reasoner-Zero
  • Uses a thread-based approach to parallelize the training and inference processes, based on Asynchronous RLHF.
  • Uses a data preparation thread to prepare the data for the training process.

In simpler tasks, we see 2x faster training, and even 10x faster for more complex tasks. With grpo_fast.py, we can run crank up number_samples_per_prompt and train on really large batch sizes.

Debug (Single GPU)

You can run the script in a single GPU mode to debug the training process.

bash scripts/train/debug/grpo_fast_mini.sh

Reproduce allenai/Llama-3.1-Tulu-3.1-8B (1 Nodes)

You can reproduce our allenai/Llama-3.1-Tulu-3.1-8B model by running the following command:

bash scripts/train/tulu3/grpo_fast_8b_single_node.sh
Info

Here the grpo_fast.py actually use 6 GPUs for training and 2 GPUs for inference, so it's using less hardware but runs faster than grpo_vllm_thread_ray_gtrl.py which uses 2 nodes (12 GPUs for training and 4 GPUs for inference).

grpo_tulu3_8b grpo_tulu3_8b_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

grpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

(๐Ÿงช Experimental) Qwen 2.5 7B GRPO Fast Zero-style

We have

bash scripts/train/qwen/grpo_fast_7b.sh

grpo_qwen2.5_7B_works grpo_qwen2.5_7B_works_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

(๐Ÿงช Experimental) Olmo2 7B GRPO Fast Zero-style

We have

bash scripts/train/olmo2/grpo_fast_7b_zero.sh

grpo_olmo2_7b_zero grpo_olmo2_7b_zero_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

(๐Ÿงช Experimental) Olmo2 13B GRPO Fast Zero-style

We have

bash scripts/train/olmo2/grpo_fast_13b_zero.sh

grpo_olmo2_13b_zero grpo_olmo2_13b_zero_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

grpo_vllm_thread_ray_gtrl.py

This implementation has the following features:

  • Uses a thread-based approach to parallelize the training and inference processes, based on Asynchronous RLHF.
  • Uses vLLM and Ray to parallelize the training process, based on how OpenRLHF does it

Debug (Single GPU)

You can run the script in a single GPU mode to debug the training process.

bash scripts/train/debug/grpo.sh

Reproduce allenai/Llama-3.1-Tulu-3.1-8B (2 Nodes)

You can reproduce our allenai/Llama-3.1-Tulu-3.1-8B model by running the following command:

bash scripts/train/tulu3/grpo_8b.sh

grpo_tulu3_8b grpo_tulu3_8b_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

Based on our internal evaluation, the GRPO model is roughly on par with the original allenai/Llama-3.1-Tulu-3.1-8B model, though there are some slight differences. Note that your results may vary slightly due to the random seeds used in the training.

grpo_plot

Reproduce allenai/OLMo-2-1124-7B-Instruct but better (2 Nodes)

You can reproduce our allenai/OLMo-2-1124-7B-Instruct model by running the following command:

bash scripts/train/olmo2/grpo_7b.sh

grpo_olmo2_7b grpo_olmo2_7b_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

Based on our internal evaluation, the GRPO model actually outperforms the original allenai/OLMo-2-1124-7B-Instruct model. This is mostly because the original allenai/OLMo-2-1124-7B-Instruct was trained with PPO, which may suffer from not using a outcome reward model to initialize the value model (since it uses a genreal RM to initialize the value model). Note that your results may vary slightly due to the random seeds used in the training.

grpo_plot

(๐Ÿงช Experimental) Qwen 2.5 7B Zero-style

Here is a command to run GRPO on the Qwen/Qwen2.5-7B on ai2-adapt-dev/math_ground_truth_zs, which is simply a zero-shot version of the RLVR MATH dataset. The training is done starting from a base model, similar to how DeepSeek R1 does it.

bash scripts/train/qwen/grpo_7b.sh

grpo_qwen2.5_7B_works grpo_qwen2.5_7B_works_time

๐Ÿ‘‰ Tracked WandB Experiments (Click to expand)

Info

Below are some learning curves for the evaluation metrics during training. Basically, ifeval, gsm8k, and math:flex all go up.

grpo_plot

Info

We haven't quite figured out how to make our internal evaluation toolchains more open yet. Stay tuned!

Training Metrics

During training, the following metrics are logged:

  • episode: the global episode number training has gone through (e.g., 3000 means we have trained on 3000 data points already -- in the case of RLVR that is prompts, which can repeat)
  • lr: the current learning rate
  • epoch: the fraction or multiple of the epoch (e.g., 2.7 means we have trained on the dataset for 2 epochs and 70% of the third epoch)
  • objective/kl: the KL divergence between the current policy and the reference policy (sum of the KL divergence of each response token)
  • objective/scores: the scores of the current response, rated by a combination of reward model and other rewards (e.g., R1 style format reward, verifiable reward, etc.)
  • objective/rlhf_reward: the RLHF reward, which is objective/scores - beta * objective/kl
  • objective/non_score_reward: beta * objective/kl
  • objective/entropy: the entropy of the current policy
  • objective/loss: the GRPO loss
  • objective/kl2: the second variant of KL divergence used in the training process, calculated similarly to objective/kl
  • objective/kl3: the third variant of KL divergence used in the training process, providing additional insights into policy divergence
  • objective/scores_mean: the mean of the scores of the current response, providing an average measure of response quality
  • objective/reward_std: the standard deviation of the rewards, indicating the variability in the reward distribution
  • objective/verifiable_correct_rate: the rate at which responses are verifiably correct, providing a measure of response accuracy
  • loss/policy_avg: the average policy loss, indicating the mean loss incurred during policy updates
  • policy/approxkl_avg: the average approximate KL divergence, used to monitor policy stability
  • policy/clipfrac_avg: the average fraction of updates where the policy was clipped, indicating how often clipping occurs
  • policy/entropy_avg: the average entropy of the policy, providing a measure of policy randomness
  • time/from_scratch: the time taken to train the model from scratch
  • time/training: the time taken to do one training step
  • val/sequence_lengths: the length of the sequences in the generated responses
  • val/num_stop_token_ids: the number of stop tokens in the generated responses
  • val/ratio: the mean ratio of the new policy to the old policy, used to assess policy updates
  • val/ratio_var: the variance of the ratio of the new policy to the old policy, indicating the variability in policy updates
  • val/stop_token_rate: the rate at which stop tokens appear in the responses, providing a measure of response termination
  • val/format_scores: the mean format scores, indicating the quality of response formatting (only logged if add_r1_style_format_reward is enabled)

Acknowledgements

We would like to thank the following resources for GRPO theory:

We would like to thank the following resources for GRPO implementation and Ray usage:

We would like to thank the following projects for general infrastructure: