Skip to content

Commit a59fb46

Browse files
authored
[doc] feat: rewrite CPU offloading documentation with measured metrics (#3062)
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
1 parent d966544 commit a59fb46

4 files changed

Lines changed: 632 additions & 74 deletions

File tree

docs/skills-index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ skills/perf-techniques/sequence-packing/SKILL
3737
skills/perf-techniques/hybrid-context-parallel/SKILL
3838
skills/perf-techniques/expert-parallel-overlap/SKILL
3939
skills/perf-techniques/moe-comm-overlap/SKILL
40+
skills/perf-techniques/cpu-offloading/SKILL
4041
```
4142

4243
```{toctree}

docs/training/cpu-offloading.md

Lines changed: 116 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,118 @@
11
# CPU Offloading
22

3-
## Overview
4-
5-
CPU Offloading in Megatron Bridge is a feature that reduces the peak memory usage of the GPU by offloading activations and inactive weights to CPU storage. Megatron Bridge supports offloading at the transformer layer level, allowing users to specify the number of transformer layers in their language model that require CPU offloading. During the forward pass, Megatron Bridge offloads activations at the optimal time and reloads them as needed during the backward pass.
6-
7-
## Features
8-
9-
- Supports training models with long sequence lengths by managing activation memory efficiently
10-
- Enables high batch sizes per GPU by offloading activation memory
11-
- Overlaps computation with data transfers (Host2Device and Device2Host) during offloading and reloading
12-
13-
## Configuration
14-
15-
CPU offloading is configured through the model provider parameters:
16-
17-
```python
18-
from megatron.bridge.models import GPTModelProvider
19-
20-
# Basic CPU offloading configuration
21-
model_config = GPTModelProvider(
22-
# Model architecture
23-
hidden_size=4096,
24-
num_layers=32,
25-
26-
# CPU offloading settings
27-
cpu_offloading=True, # Enable CPU offloading
28-
cpu_offloading_num_layers=16, # Number of layers to offload (0 to num_layers-1)
29-
cpu_offloading_activations=True, # Offload activations
30-
cpu_offloading_weights=True, # Offload weights
31-
32-
# ... other model parameters
33-
)
34-
```
35-
36-
### Configuration Parameters
37-
38-
- **`cpu_offloading`**: Set to `True` to enable CPU offloading
39-
- **`cpu_offloading_num_layers`**: Number of transformer layers to offload (value between 0 and total number of layers minus one)
40-
- **`cpu_offloading_activations`**: Whether to offload activations to CPU memory (default: `True`)
41-
- **`cpu_offloading_weights`**: Whether to offload inactive weights to CPU memory (default: `False`)
42-
- **`cpu_offloading_double_buffering`**: Enable double buffering across layers while reloading activations from CPU (default: `False`)
43-
44-
### Offloading Strategies
45-
46-
You can configure different combinations of offloading based on your memory requirements:
47-
48-
#### Activations Only
49-
```python
50-
model_config = GPTModelProvider(
51-
cpu_offloading=True,
52-
cpu_offloading_num_layers=8,
53-
cpu_offloading_activations=True, # Offload activations
54-
cpu_offloading_weights=False, # Keep weights on GPU
55-
)
56-
```
57-
58-
#### Weights Only
59-
```python
60-
model_config = GPTModelProvider(
61-
cpu_offloading=True,
62-
cpu_offloading_num_layers=8,
63-
cpu_offloading_activations=False, # Keep activations on GPU
64-
cpu_offloading_weights=True, # Offload weights
65-
)
66-
```
67-
68-
#### Both Activations and Weights
69-
```python
70-
model_config = GPTModelProvider(
71-
cpu_offloading=True,
72-
cpu_offloading_num_layers=8,
73-
cpu_offloading_activations=True, # Offload activations
74-
cpu_offloading_weights=True, # Offload weights
75-
)
76-
```
3+
CPU offloading reduces per-GPU memory by moving data to host (CPU) memory
4+
during training, trading throughput for the ability to train models or
5+
configurations that would otherwise not fit in GPU memory.
6+
7+
For operational setup, code anchors, and verification commands, see
8+
[skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md).
9+
10+
## What It Is
11+
12+
Megatron Bridge supports two independent CPU offloading mechanisms:
13+
14+
| Mechanism | What gets offloaded | Implementation |
15+
|---|---|---|
16+
| **Activation offloading** | Activations (and optionally weights) per transformer layer | MCore `cpu_offloading_context` in transformer block |
17+
| **Optimizer offloading** | Optimizer states (Adam momentum + variance) | MCore `HybridDeviceOptimizer` with configurable GPU/CPU split |
18+
19+
Activation offloading moves layer activations to CPU during forward and
20+
reloads them during backward. Optimizer offloading keeps a configurable
21+
fraction of Adam optimizer states on CPU and runs the optimizer step there.
22+
23+
These are independent features addressing different memory pools. They can
24+
be used separately but not always together due to constraint conflicts.
25+
26+
## What Problem It Solves
27+
28+
Large models, especially MoE architectures, can exhaust GPU memory even with
29+
standard parallelism techniques (TP, PP, EP). The two offloading mechanisms
30+
target different bottlenecks:
31+
32+
- **Activation offloading** helps when activation memory dominates — common
33+
with long sequences, large batch sizes, or when recomputation is disabled.
34+
- **Optimizer offloading** helps when optimizer state memory dominates — Adam
35+
keeps two state tensors (momentum + variance) per parameter, doubling the
36+
parameter memory footprint. For a 30B MoE model this can be 15+ GB per GPU.
37+
38+
## Impacted Training Dimensions
39+
40+
| Dimension | Effect | Confidence | Rationale |
41+
|-----------|--------|------------|-----------|
42+
| Speed | 1.9x–4.2x slower step time (scales linearly with offload fraction) | high | CPU Adam compute and D2H/H2D transfers add latency. Measured on Qwen3-30B-A3B TP2 PP2 EP4. D2H/H2D overlap reduces 100% penalty from 4.2x to 3.9x. |
43+
| Memory | 3.8 GB saved per 25% of optimizer offload fraction (up to 15.3 GB / 32% at 100%) | high | Measured on Qwen3-30B-A3B (47.2 GB baseline). Activation offload saves proportional to layers offloaded. |
44+
| Scale | enables otherwise-OOM configurations | medium | Can free memory for larger batch sizes or additional parallelism. |
45+
| Convergence | no change (loss delta < 0.001 across all fractions) | high | All optimizer offload fractions (25–100%) produce identical loss across 20 iterations. |
46+
| Stability | no issues observed | high | No errors, hangs, or NCCL issues across 120 total iterations tested (6 configurations). |
47+
48+
D2H (device-to-host) and H2D (host-to-device) refer to data transfers between
49+
GPU and CPU memory. Each optimizer step copies gradients to CPU (D2H), runs
50+
Adam on CPU, then copies updated parameters back (H2D). The
51+
`overlap_cpu_optimizer_d2h_h2d` flag overlaps these transfers with compute.
52+
On Qwen3-30B-A3B MoE this provided only ~7% speedup because CPU-side Adam
53+
compute — not the transfers — was the dominant bottleneck. Other models with
54+
different parameter counts or optimizer configurations may see different
55+
transfer-to-compute ratios.
56+
57+
## When to Use It
58+
59+
- GPU memory is tight and throughput regression is acceptable
60+
- The model requires PP > 1 to fit — use **optimizer offloading** (activation
61+
offloading requires PP=1)
62+
- You want a tunable memory-speed tradeoff via `optimizer_offload_fraction`
63+
- Activation memory is the bottleneck and the model fits with PP=1 and no
64+
recompute — use **activation offloading**
65+
66+
## When Not to Use It
67+
68+
- Throughput is the primary concern — offloading always adds overhead
69+
- The model already fits comfortably in GPU memory
70+
- CUDA graphs are enabled — activation offloading is incompatible
71+
- The model is large (30B+ MoE) and requires PP > 1 — activation offloading
72+
is blocked by the PP=1 constraint
73+
- Alternative memory techniques (FSDP, activation recomputation) provide
74+
sufficient savings without the throughput penalty
75+
76+
## Feature Interactions
77+
78+
| Feature | Interaction | Details |
79+
|---------|-------------|---------|
80+
| Pipeline parallelism (PP > 1) | **Blocks** activation offloading | Hard MCore constraint. Use optimizer offloading instead. |
81+
| Activation recomputation | **Blocks** activation offloading | Hard MCore constraint. Cannot combine. |
82+
| CUDA graphs | **Blocks** activation offloading | Hard MCore constraint. Optimizer offloading is unaffected. |
83+
| Fine-grained activation offloading | **Mutual exclusion** with layer-level activation offloading | Use one or the other. Fine-grained works with PP > 1. |
84+
| Distributed optimizer | **Required** for optimizer offloading | `use_distributed_optimizer=True` (default in most recipes). |
85+
| Megatron FSDP | Alternative | Shards parameters across DP ranks. Different tradeoff profile. |
86+
| Expert parallelism | Compatible | Both offloading mechanisms work with EP. |
87+
88+
## Bridge Configuration
89+
90+
CPU offloading is configured through two independent config namespaces:
91+
92+
- **Optimizer offloading**: `optimizer.optimizer_cpu_offload`,
93+
`optimizer.optimizer_offload_fraction`, and
94+
`optimizer.overlap_cpu_optimizer_d2h_h2d`
95+
- **Activation offloading**: `model.cpu_offloading`,
96+
`model.cpu_offloading_num_layers`, and related `model.cpu_offloading_*` fields
97+
98+
For config examples, parameter tables, and runnable commands, see
99+
[skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md).
100+
101+
## Common Failure Modes
102+
103+
| Symptom | Cause | Fix |
104+
|---------|-------|-----|
105+
| `Currently there is no support for Pipeline parallelism with CPU offloading` | Activation offload with PP > 1 | Set PP=1 or switch to optimizer offloading |
106+
| `CPU offloading does not work when activation recomputation is enabled` | Activation offload with recompute enabled | Set `recompute_granularity=null` |
107+
| `CUDA graphs not supported with CPU offloading` | Activation offload with CUDA graphs | Set `cuda_graph_impl="none"` |
108+
| `fine_grained_activation_offloading cannot be enabled with cpu_offloading` | Both offloading types enabled | Use one or the other |
109+
| OOM with activation offloading on large model | Model too large for PP=1 | Switch to optimizer offloading (works with PP > 1) |
110+
| >4x throughput regression | 100% optimizer offload, CPU Adam bottleneck | Reduce fraction or enable `overlap_cpu_optimizer_d2h_h2d` |
111+
112+
## Related Docs
113+
114+
- [docs/training/activation-recomputation.md](activation-recomputation.md)
115+
- [docs/training/megatron-fsdp.md](megatron-fsdp.md)
116+
- [docs/training/optimizer-scheduler.md](optimizer-scheduler.md)
117+
- [docs/training/cuda-graphs.md](cuda-graphs.md)
118+
- [skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md)

0 commit comments

Comments
 (0)