OLMo-core Sharding and Parallelism

The DPO and GRPO trainers in open-instruct use OLMo-core's parallelism primitives for distributed training. This page documents their shared sharding architecture and per-algorithm differences.

Overview

OLMo-core uses Hybrid Sharded Data Parallelism (HSDP) via PyTorch's DTensor-based FSDP2. HSDP combines full sharding within a group of GPUs with replication across groups, balancing memory savings with communication efficiency.

DPO and GRPO share these defaults:

Setting	Value
Parallelism type	HSDP (`DataParallelType.hsdp`)
Wrapping strategy	`blocks` (each Transformer block is one FSDP unit)
Parameter dtype	`bfloat16`
Reduction dtype	`float32`
Activation checkpointing	Budget-mode (when `activation_memory_budget < 1.0` and `compile_model`)

How HSDP Works

In HSDP, GPUs are organized into shard groups and replica groups:

Shard group: A set of GPUs that collectively hold one full copy of the model. Parameters and gradients are sharded (split) across these GPUs, reducing per-GPU memory.
Replica group: Independent shard groups that each hold a full copy. Gradients are all-reduced across replicas after each step.

For example, with 16 GPUs, shard_degree=8, and num_replicas=2:

GPUs 0-7 form shard group 1, GPUs 8-15 form shard group 2.
Each group holds a complete model, sharded across 8 GPUs.
After the backward pass, gradients are reduced across the two groups.

When shard_degree and num_replicas are both None (the default), OLMo-core auto-detects using node boundaries: num_replicas is set to the number of nodes and shard_degree is set to the number of GPUs per node. This means all-gather/reduce-scatter traffic stays within a single node (fast NVLink), while only the lighter all-reduce crosses the network between nodes.

If only one of the two is specified, OLMo-core infers the other by dividing dp_world_size by the given value. Both values must evenly divide the data-parallel world size.

See DataParallelConfig.get_replicate_and_shard_degree() for the implementation.

Wrapping Strategy

Both trainers use the blocks wrapping strategy, which wraps each Transformer block as an individual FSDP unit. This provides a good balance between:

Communication overhead: Fewer FSDP units means fewer all-gather/reduce-scatter operations.
Memory efficiency: Each block's parameters can be gathered and freed independently.

Sharding Flags

Both DPO and GRPO expose the same HSDP sharding flags:

Flag	Description	Default
`--fsdp_shard_degree`	Number of GPUs per shard group	`None` (auto)
`--fsdp_num_replicas`	Number of replica groups	`None` (auto)

GRPO additionally validates that the sharding configuration is consistent with the total number of learner GPUs (from --num_learners_per_node). If both flags are specified, their product must equal the total learner GPU count. If only one is specified, it must evenly divide the total.

Per-Algorithm Configuration

Both DPO and GRPO create TransformerDataParallelConfig with the same structure:

dp_config = TransformerDataParallelConfig(
    name=DataParallelType.hsdp,
    num_replicas=args.fsdp_num_replicas,   # None = auto
    shard_degree=args.fsdp_shard_degree,   # None = auto
    param_dtype=...,                       # bfloat16 (DPO) or configurable (GRPO)
    reduce_dtype=DType.float32,
    wrapping_strategy=TransformerDataParallelWrappingStrategy.blocks,
)

DPO

DPO (open_instruct/dpo.py) additionally supports these parallelism flags:

Flag	Description	Default
`--tensor_parallel_degree`	Tensor parallelism degree	`1` (disabled)
`--context_parallel_degree`	Context (sequence) parallelism degree	`1` (disabled)

DPO supports tensor parallelism (splitting individual layers across GPUs), which can be combined with HSDP for very large models. Context parallelism (--context_parallel_degree) is not yet supported in DPO.

GRPO

GRPO (open_instruct/grpo_olmo_core_actor.py) differs from DPO in:

Single-GPU mode. When --single_gpu_mode is set, dp_config is None and no sharding is applied. The model is instead cast to the target dtype directly.
Ray actors. GRPO uses Ray to coordinate distributed training, so torch.distributed is initialized within each Ray actor rather than at the script level.

Comparison

DPO and GRPO share HSDP with blocks wrapping and float32 reductions. The table below shows where they differ:

Aspect	DPO	GRPO
Tensor parallelism	Yes	No
Context parallelism	No	No
Training coordinator	`torch.distributed`	Ray actors

Choosing Parallelism Settings

For most use cases, leaving fsdp_shard_degree and fsdp_num_replicas as None (auto-detect) works well. Manually tune these when:

You need to control communication patterns. For example, setting shard_degree to the number of GPUs per node ensures all-gather/reduce-scatter stays within a single node (fast NVLink), while cross-node communication is limited to all-reduce across replicas.
You're hitting OOM. Increasing shard_degree (more GPUs per shard group) reduces per-GPU memory at the cost of more communication.
You're scaling to many nodes. For large clusters, explicitly setting num_replicas can prevent OLMo-core from choosing a suboptimal layout.

For very large models (32B+), consider enabling tensor parallelism (--tensor_parallel_degree) in DPO to keep per-GPU memory manageable.

FSDP-First Loading Pattern

DPO and GRPO follow the same model initialization sequence:

Build the OLMo-core Transformer model on CPU.
Create the TrainModule, which calls parallelize_model() internally. This applies FSDP sharding and reinitializes weights from scratch.
Reload the HuggingFace checkpoint into the now-sharded model via load_hf_model().

This "FSDP-first" pattern ensures that FSDP metadata is set up before any real weights are loaded, avoiding the need to hold a full unsharded copy in memory. See DPOTrainModule in open_instruct/olmo_core_train_modules.py and PolicyTrainerOLMoCoreProcess.setup_model() in open_instruct/grpo_olmo_core_actor.py for the DPO and GRPO implementations respectively.