|
1 | 1 | # CPU Offloading |
2 | 2 |
|
3 | | -## Overview |
4 | | - |
5 | | -CPU Offloading in Megatron Bridge is a feature that reduces the peak memory usage of the GPU by offloading activations and inactive weights to CPU storage. Megatron Bridge supports offloading at the transformer layer level, allowing users to specify the number of transformer layers in their language model that require CPU offloading. During the forward pass, Megatron Bridge offloads activations at the optimal time and reloads them as needed during the backward pass. |
6 | | - |
7 | | -## Features |
8 | | - |
9 | | -- Supports training models with long sequence lengths by managing activation memory efficiently |
10 | | -- Enables high batch sizes per GPU by offloading activation memory |
11 | | -- Overlaps computation with data transfers (Host2Device and Device2Host) during offloading and reloading |
12 | | - |
13 | | -## Configuration |
14 | | - |
15 | | -CPU offloading is configured through the model provider parameters: |
16 | | - |
17 | | -```python |
18 | | -from megatron.bridge.models import GPTModelProvider |
19 | | - |
20 | | -# Basic CPU offloading configuration |
21 | | -model_config = GPTModelProvider( |
22 | | - # Model architecture |
23 | | - hidden_size=4096, |
24 | | - num_layers=32, |
25 | | - |
26 | | - # CPU offloading settings |
27 | | - cpu_offloading=True, # Enable CPU offloading |
28 | | - cpu_offloading_num_layers=16, # Number of layers to offload (0 to num_layers-1) |
29 | | - cpu_offloading_activations=True, # Offload activations |
30 | | - cpu_offloading_weights=True, # Offload weights |
31 | | - |
32 | | - # ... other model parameters |
33 | | -) |
34 | | -``` |
35 | | - |
36 | | -### Configuration Parameters |
37 | | - |
38 | | -- **`cpu_offloading`**: Set to `True` to enable CPU offloading |
39 | | -- **`cpu_offloading_num_layers`**: Number of transformer layers to offload (value between 0 and total number of layers minus one) |
40 | | -- **`cpu_offloading_activations`**: Whether to offload activations to CPU memory (default: `True`) |
41 | | -- **`cpu_offloading_weights`**: Whether to offload inactive weights to CPU memory (default: `False`) |
42 | | -- **`cpu_offloading_double_buffering`**: Enable double buffering across layers while reloading activations from CPU (default: `False`) |
43 | | - |
44 | | -### Offloading Strategies |
45 | | - |
46 | | -You can configure different combinations of offloading based on your memory requirements: |
47 | | - |
48 | | -#### Activations Only |
49 | | -```python |
50 | | -model_config = GPTModelProvider( |
51 | | - cpu_offloading=True, |
52 | | - cpu_offloading_num_layers=8, |
53 | | - cpu_offloading_activations=True, # Offload activations |
54 | | - cpu_offloading_weights=False, # Keep weights on GPU |
55 | | -) |
56 | | -``` |
57 | | - |
58 | | -#### Weights Only |
59 | | -```python |
60 | | -model_config = GPTModelProvider( |
61 | | - cpu_offloading=True, |
62 | | - cpu_offloading_num_layers=8, |
63 | | - cpu_offloading_activations=False, # Keep activations on GPU |
64 | | - cpu_offloading_weights=True, # Offload weights |
65 | | -) |
66 | | -``` |
67 | | - |
68 | | -#### Both Activations and Weights |
69 | | -```python |
70 | | -model_config = GPTModelProvider( |
71 | | - cpu_offloading=True, |
72 | | - cpu_offloading_num_layers=8, |
73 | | - cpu_offloading_activations=True, # Offload activations |
74 | | - cpu_offloading_weights=True, # Offload weights |
75 | | -) |
76 | | -``` |
| 3 | +CPU offloading reduces per-GPU memory by moving data to host (CPU) memory |
| 4 | +during training, trading throughput for the ability to train models or |
| 5 | +configurations that would otherwise not fit in GPU memory. |
| 6 | + |
| 7 | +For operational setup, code anchors, and verification commands, see |
| 8 | +[skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md). |
| 9 | + |
| 10 | +## What It Is |
| 11 | + |
| 12 | +Megatron Bridge supports two independent CPU offloading mechanisms: |
| 13 | + |
| 14 | +| Mechanism | What gets offloaded | Implementation | |
| 15 | +|---|---|---| |
| 16 | +| **Activation offloading** | Activations (and optionally weights) per transformer layer | MCore `cpu_offloading_context` in transformer block | |
| 17 | +| **Optimizer offloading** | Optimizer states (Adam momentum + variance) | MCore `HybridDeviceOptimizer` with configurable GPU/CPU split | |
| 18 | + |
| 19 | +Activation offloading moves layer activations to CPU during forward and |
| 20 | +reloads them during backward. Optimizer offloading keeps a configurable |
| 21 | +fraction of Adam optimizer states on CPU and runs the optimizer step there. |
| 22 | + |
| 23 | +These are independent features addressing different memory pools. They can |
| 24 | +be used separately but not always together due to constraint conflicts. |
| 25 | + |
| 26 | +## What Problem It Solves |
| 27 | + |
| 28 | +Large models, especially MoE architectures, can exhaust GPU memory even with |
| 29 | +standard parallelism techniques (TP, PP, EP). The two offloading mechanisms |
| 30 | +target different bottlenecks: |
| 31 | + |
| 32 | +- **Activation offloading** helps when activation memory dominates — common |
| 33 | + with long sequences, large batch sizes, or when recomputation is disabled. |
| 34 | +- **Optimizer offloading** helps when optimizer state memory dominates — Adam |
| 35 | + keeps two state tensors (momentum + variance) per parameter, doubling the |
| 36 | + parameter memory footprint. For a 30B MoE model this can be 15+ GB per GPU. |
| 37 | + |
| 38 | +## Impacted Training Dimensions |
| 39 | + |
| 40 | +| Dimension | Effect | Confidence | Rationale | |
| 41 | +|-----------|--------|------------|-----------| |
| 42 | +| Speed | 1.9x–4.2x slower step time (scales linearly with offload fraction) | high | CPU Adam compute and D2H/H2D transfers add latency. Measured on Qwen3-30B-A3B TP2 PP2 EP4. D2H/H2D overlap reduces 100% penalty from 4.2x to 3.9x. | |
| 43 | +| Memory | 3.8 GB saved per 25% of optimizer offload fraction (up to 15.3 GB / 32% at 100%) | high | Measured on Qwen3-30B-A3B (47.2 GB baseline). Activation offload saves proportional to layers offloaded. | |
| 44 | +| Scale | enables otherwise-OOM configurations | medium | Can free memory for larger batch sizes or additional parallelism. | |
| 45 | +| Convergence | no change (loss delta < 0.001 across all fractions) | high | All optimizer offload fractions (25–100%) produce identical loss across 20 iterations. | |
| 46 | +| Stability | no issues observed | high | No errors, hangs, or NCCL issues across 120 total iterations tested (6 configurations). | |
| 47 | + |
| 48 | +D2H (device-to-host) and H2D (host-to-device) refer to data transfers between |
| 49 | +GPU and CPU memory. Each optimizer step copies gradients to CPU (D2H), runs |
| 50 | +Adam on CPU, then copies updated parameters back (H2D). The |
| 51 | +`overlap_cpu_optimizer_d2h_h2d` flag overlaps these transfers with compute. |
| 52 | +On Qwen3-30B-A3B MoE this provided only ~7% speedup because CPU-side Adam |
| 53 | +compute — not the transfers — was the dominant bottleneck. Other models with |
| 54 | +different parameter counts or optimizer configurations may see different |
| 55 | +transfer-to-compute ratios. |
| 56 | + |
| 57 | +## When to Use It |
| 58 | + |
| 59 | +- GPU memory is tight and throughput regression is acceptable |
| 60 | +- The model requires PP > 1 to fit — use **optimizer offloading** (activation |
| 61 | + offloading requires PP=1) |
| 62 | +- You want a tunable memory-speed tradeoff via `optimizer_offload_fraction` |
| 63 | +- Activation memory is the bottleneck and the model fits with PP=1 and no |
| 64 | + recompute — use **activation offloading** |
| 65 | + |
| 66 | +## When Not to Use It |
| 67 | + |
| 68 | +- Throughput is the primary concern — offloading always adds overhead |
| 69 | +- The model already fits comfortably in GPU memory |
| 70 | +- CUDA graphs are enabled — activation offloading is incompatible |
| 71 | +- The model is large (30B+ MoE) and requires PP > 1 — activation offloading |
| 72 | + is blocked by the PP=1 constraint |
| 73 | +- Alternative memory techniques (FSDP, activation recomputation) provide |
| 74 | + sufficient savings without the throughput penalty |
| 75 | + |
| 76 | +## Feature Interactions |
| 77 | + |
| 78 | +| Feature | Interaction | Details | |
| 79 | +|---------|-------------|---------| |
| 80 | +| Pipeline parallelism (PP > 1) | **Blocks** activation offloading | Hard MCore constraint. Use optimizer offloading instead. | |
| 81 | +| Activation recomputation | **Blocks** activation offloading | Hard MCore constraint. Cannot combine. | |
| 82 | +| CUDA graphs | **Blocks** activation offloading | Hard MCore constraint. Optimizer offloading is unaffected. | |
| 83 | +| Fine-grained activation offloading | **Mutual exclusion** with layer-level activation offloading | Use one or the other. Fine-grained works with PP > 1. | |
| 84 | +| Distributed optimizer | **Required** for optimizer offloading | `use_distributed_optimizer=True` (default in most recipes). | |
| 85 | +| Megatron FSDP | Alternative | Shards parameters across DP ranks. Different tradeoff profile. | |
| 86 | +| Expert parallelism | Compatible | Both offloading mechanisms work with EP. | |
| 87 | + |
| 88 | +## Bridge Configuration |
| 89 | + |
| 90 | +CPU offloading is configured through two independent config namespaces: |
| 91 | + |
| 92 | +- **Optimizer offloading**: `optimizer.optimizer_cpu_offload`, |
| 93 | + `optimizer.optimizer_offload_fraction`, and |
| 94 | + `optimizer.overlap_cpu_optimizer_d2h_h2d` |
| 95 | +- **Activation offloading**: `model.cpu_offloading`, |
| 96 | + `model.cpu_offloading_num_layers`, and related `model.cpu_offloading_*` fields |
| 97 | + |
| 98 | +For config examples, parameter tables, and runnable commands, see |
| 99 | +[skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md). |
| 100 | + |
| 101 | +## Common Failure Modes |
| 102 | + |
| 103 | +| Symptom | Cause | Fix | |
| 104 | +|---------|-------|-----| |
| 105 | +| `Currently there is no support for Pipeline parallelism with CPU offloading` | Activation offload with PP > 1 | Set PP=1 or switch to optimizer offloading | |
| 106 | +| `CPU offloading does not work when activation recomputation is enabled` | Activation offload with recompute enabled | Set `recompute_granularity=null` | |
| 107 | +| `CUDA graphs not supported with CPU offloading` | Activation offload with CUDA graphs | Set `cuda_graph_impl="none"` | |
| 108 | +| `fine_grained_activation_offloading cannot be enabled with cpu_offloading` | Both offloading types enabled | Use one or the other | |
| 109 | +| OOM with activation offloading on large model | Model too large for PP=1 | Switch to optimizer offloading (works with PP > 1) | |
| 110 | +| >4x throughput regression | 100% optimizer offload, CPU Adam bottleneck | Reduce fraction or enable `overlap_cpu_optimizer_d2h_h2d` | |
| 111 | + |
| 112 | +## Related Docs |
| 113 | + |
| 114 | +- [docs/training/activation-recomputation.md](activation-recomputation.md) |
| 115 | +- [docs/training/megatron-fsdp.md](megatron-fsdp.md) |
| 116 | +- [docs/training/optimizer-scheduler.md](optimizer-scheduler.md) |
| 117 | +- [docs/training/cuda-graphs.md](cuda-graphs.md) |
| 118 | +- [skills/perf-techniques/cpu-offloading/SKILL.md](../skills/perf-techniques/cpu-offloading/SKILL.md) |
0 commit comments