Skip to content

Commit 9085707

Browse files
TimDettmersclaude
andcommitted
docs: Add CPU→GPU weight streaming analysis and benchmark
Theoretical model and empirical validation of layer-by-layer weight streaming for QLoRA training, enabling 70B+ models on a single consumer GPU by keeping frozen base weights in CPU DRAM or NVMe. Includes: - Theoretical bandwidth model (PCIe, NVMe, three-tier pipeline) - Configuration grids for Llama-70B and GLM-4.7 (355B MoE) - stream_bench.py: benchmark measuring actual PCIe overlap, matmul throughput, and double-buffered pipeline overhead - Analysis of 2/3/4-bit quantization impact on MoE streaming Key finding: <0.5% pipeline overhead once per-layer compute exceeds PCIe transfer time (~4K tokens on RTX 4090 + PCIe 3.0 for Llama-70B). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ffadf57 commit 9085707

File tree

2 files changed

+833
-0
lines changed

2 files changed

+833
-0
lines changed

docs/streaming_analysis/README.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# CPU→GPU Weight Streaming for QLoRA Training
2+
3+
This document analyzes the feasibility of streaming frozen base weights from
4+
CPU DRAM (or NVMe) to GPU during QLoRA training, eliminating the need to hold
5+
the full model in VRAM. The analysis covers a theoretical bandwidth model,
6+
empirical validation on an RTX 4090, and configuration grids for multiple
7+
hardware configurations.
8+
9+
## Key Result
10+
11+
**Weight streaming works with near-zero overhead** once per-layer compute time
12+
exceeds the PCIe transfer time. On an RTX 4090 with PCIe 3.0 x16, the
13+
crossover is approximately 4K tokens per step. Above this, the double-buffered
14+
pipeline hides 100% of the transfer latency with <0.5% measured overhead.
15+
16+
## Architecture
17+
18+
QLoRA freezes the base model weights and only trains low-rank adapters. The
19+
frozen weights are read-only during both forward and backward passes, making
20+
them ideal candidates for streaming from slower storage tiers.
21+
22+
### Three-tier pipeline
23+
24+
```
25+
NVMe SSD ──(3.5-14 GB/s)──> CPU DRAM ──(11-50 GB/s PCIe)──> GPU VRAM
26+
cold storage bandwidth buffer compute
27+
```
28+
29+
The GPU maintains a **double buffer** (2 layer slots). While computing on one
30+
layer, the next layer transfers asynchronously from CPU DRAM via PCIe DMA on a
31+
dedicated CUDA stream. The CPU DRAM buffer decouples the NVMe and GPU rates.
32+
33+
### Double-buffer operation
34+
35+
```
36+
Time ──────────────────────────────────────────────────────>
37+
GPU: [compute L0] [compute L1] [compute L2] [compute L3]
38+
PCIe: [xfer L1 ] [xfer L2 ] [xfer L3 ] [xfer L4 ]
39+
NVMe→CPU: [read L2 ] [read L3 ] [read L4 ]
40+
```
41+
42+
Each layer occupies one slot. After the GPU finishes computing a layer, that
43+
slot is freed for the next incoming transfer. No GPU idle time occurs as long
44+
as `compute_time >= transfer_time`.
45+
46+
## Theoretical Model
47+
48+
### Per-layer timing
49+
50+
For a transformer layer with `P_active` active parameters:
51+
52+
```
53+
compute_ms = tokens × 3 × 2 × P_active / GPU_FLOPS × 1000
54+
↑ ↑
55+
│ └─ 2 FLOPs per multiply-accumulate
56+
└───── 3× for training (forward + backward ≈ 3× forward)
57+
58+
transfer_ms = layer_size_bytes / pcie_bandwidth × 1000
59+
```
60+
61+
For MoE models, `P_active` is the active subset (routed experts + attention +
62+
shared expert), while `layer_size_bytes` includes **all** experts since the
63+
routing decision is token-dependent.
64+
65+
### GPU ring buffer sizing
66+
67+
```
68+
K_ring = max(2, ceil(transfer_ms / compute_ms) + 1)
69+
```
70+
71+
The ring buffer holds `K_ring` layers. The GPU processes `K_ring - 1` layers
72+
while one layer transfers. When `compute_ms > transfer_ms`, `K_ring = 2`
73+
(double buffer) suffices.
74+
75+
### NVMe CPU buffer sizing
76+
77+
The CPU DRAM buffer must absorb the rate mismatch between NVMe reads and GPU
78+
consumption. Over the total processing time, NVMe delivers:
79+
80+
```
81+
nvme_delivered = n_layers × layer_cycle_ms / nvme_ms
82+
cpu_buffer_layers = n_layers - K_ring - nvme_delivered
83+
```
84+
85+
Where `layer_cycle_ms = max(compute_ms, transfer_ms)`.
86+
87+
### Pipeline throughput
88+
89+
```
90+
step_time = n_layers × max(compute_ms, transfer_ms, nvme_ms)
91+
```
92+
93+
The slowest leg (compute, PCIe, or NVMe) determines throughput. Buffering
94+
shifts when the bottleneck hits, but doesn't change the steady-state rate.
95+
96+
## Measured Hardware Parameters (RTX 4090 + PCIe 3.0)
97+
98+
| Parameter | Theoretical | Measured |
99+
|---|---|---|
100+
| PCIe Gen3 x16 H2D (pinned) | 13 GB/s | **11 GB/s** (85% eff.) |
101+
| PCIe Gen3 x16 H2D (pageable) || **7 GB/s** |
102+
| GPU FP16 tensor throughput | 330 TFLOPS | **160 TFLOPS** |
103+
| Transfer/layer (470 MB, Llama-70B) | 36ms | **43ms** |
104+
| Compute/layer @ 4K tokens | 61ms | **50ms** |
105+
| Compute/layer @ 8K tokens | 122ms | **100ms** |
106+
| Pipeline overhead (compute > transfer) | 0% | **<0.5%** |
107+
108+
The GPU achieves ~160 TFLOPS on these matmul shapes (not the peak 330 TFLOPS,
109+
which requires ideal tile sizes). PCIe runs at 85% of theoretical due to
110+
protocol overhead. Both deviations are consistent and predictable.
111+
112+
## Benchmark Results
113+
114+
### Pipeline overhead vs. batch size (Llama-70B, 470 MB/layer)
115+
116+
| Tokens | Compute/layer | Transfer/layer | Overhead | Verdict |
117+
|--------|--------------|----------------|----------|---------|
118+
| 512 | 6.4ms | 42ms | +536% | PCIe-limited |
119+
| 1024 | 12.6ms | 43ms | +224% | PCIe-limited |
120+
| 2048 | 24.5ms | 43ms | +67% | PCIe-limited |
121+
| **4096** | **50ms** | **43ms** | **+0.2%** | **Fully hidden** |
122+
| 8192 | 100ms | 42ms | +0.4% | Fully hidden |
123+
124+
The crossover is sharp: below ~4K tokens the GPU idles waiting for PCIe;
125+
above it, transfers are completely hidden behind compute.
126+
127+
### Memory savings
128+
129+
For the pipeline test with 20 × 470 MB layers (9.2 GB total weights):
130+
131+
- GPU ring buffer (2 layers): **0.92 GB**
132+
- GPU peak memory: **3.55 GB** (ring + activations + compute buffers)
133+
- VRAM savings: **90%**
134+
135+
Extrapolated to full Llama-70B (80 layers, 38 GB total):
136+
137+
- GPU ring buffer: **0.94 GB** (2 layers)
138+
- Remaining 78 layers: in CPU DRAM or NVMe
139+
- VRAM savings: **97.5%**
140+
141+
## Configuration Grids
142+
143+
### Llama-70B (dense, 80 layers, 856M params/layer, 38 GB NF4)
144+
145+
Minimum feasible batch size per hardware configuration:
146+
147+
| NVMe config | Gen3 x16 PCIe | Gen4 x16 | Gen5 x16 |
148+
|---|---|---|---|
149+
| 1× Gen3 (3.5 GB/s) | 1K (11s/step) | 4K (11s) | 4K (11s) |
150+
| 2× Gen3 (7 GB/s) | 1K (5s) | 1K (5s) | 2K (5s) |
151+
| 1× Gen4 (7 GB/s) | 1K (5s) | 1K (5s) | 2K (5s) |
152+
| 2× Gen5 (28 GB/s) | n/a | n/a | 1K (1s) |
153+
154+
Dense models are straightforward — 100% of transferred weights contribute to
155+
compute, so even slow NVMe works at small batch sizes.
156+
157+
### GLM-4.7 (355B MoE, 92 layers, 4.1B params/layer, 207 GB NF4)
158+
159+
The MoE architecture creates a poor weight-to-compute ratio: each layer
160+
transfers 2.25 GB (all 160 experts) but only 12.5% (8 active experts +
161+
attention + shared) contributes FLOPs. This makes the model significantly
162+
harder to stream.
163+
164+
**With 32 GB system RAM:**
165+
166+
| NVMe config | Gen4 x16 | Gen5 x16 |
167+
|---|---|---|
168+
| 1× Gen4 (7 GB/s) | 32K (30s) | 32K (30s) |
169+
| 2× Gen4 (14 GB/s) | 16K (15s) | 16K (15s) |
170+
| 2× Gen5 (28 GB/s) | n/a | **8K (7s)** |
171+
| 4× Gen4 (28 GB/s) | n/a | **8K (7s)** |
172+
173+
**With 128 GB system RAM** (larger CPU buffer absorbs NVMe rate mismatch):
174+
175+
| NVMe config | Gen4 x16 | Gen5 x16 |
176+
|---|---|---|
177+
| 2× Gen4 (14 GB/s) | 2K (15s) | 8K (15s) |
178+
| 2× Gen5 (28 GB/s) | n/a | **1K (7s)** |
179+
| 4× Gen4 (28 GB/s) | n/a | **1K (7s)** |
180+
181+
### Effect of lower-bit quantization on GLM-4.7
182+
183+
Lower quantization reduces transfer size without changing compute (weights
184+
are dequantized to FP16 before matmul):
185+
186+
| Quantization | Layer size | Model size | Transfer/layer | Min tokens (0% overhead) |
187+
|---|---|---|---|---|
188+
| NF4 (4-bit) | 2.25 GB | 207 GB | 205ms | ~11K |
189+
| NF3 (3-bit) | 1.64 GB | 151 GB | 149ms | ~8K |
190+
| NF2 (2-bit) | 1.15 GB | 106 GB | 104ms | ~5K |
191+
192+
At NF2, the 355B MoE model's per-layer transfer time approaches that of a 70B
193+
dense model at NF4, making streaming much more practical.
194+
195+
## Implementation Notes
196+
197+
### Critical for correct overlap
198+
199+
1. **Pinned memory**: CPU buffers must use `pin_memory=True`. Pageable memory
200+
drops bandwidth from 11 GB/s to 7 GB/s and prevents true async DMA.
201+
202+
2. **Pre-allocated output buffers**: Use `torch.mm(A, B, out=C)` instead of
203+
`C = torch.mm(A, B)`. Temporary tensor allocations cause implicit CUDA
204+
synchronizations that serialize the pipeline. In testing, this single change
205+
reduced pipeline overhead from 78% to <0.5%.
206+
207+
3. **Dedicated copy stream**: Use a separate `torch.cuda.Stream()` for H2D
208+
transfers. The default stream serializes all operations.
209+
210+
4. **Stream synchronization**: After compute, call
211+
`torch.cuda.current_stream().wait_stream(copy_stream)` before the next
212+
iteration to ensure the incoming layer is ready.
213+
214+
### What doesn't work
215+
216+
- **torch.mm() without `out=`** in the pipeline loop — causes CUDA allocator
217+
syncs, defeating the overlap.
218+
- **Pageable (non-pinned) CPU memory** — the CUDA runtime copies through an
219+
internal staging buffer, halving bandwidth and preventing overlap.
220+
- **Single CUDA stream** — serializes compute and transfer.
221+
222+
## Running the Benchmark
223+
224+
```bash
225+
# Llama-70B layer size, 4K tokens (should show ~0% overhead)
226+
python docs/streaming_analysis/stream_bench.py --layer-mb 470 --pipeline-tokens 4096
227+
228+
# GLM-4.7 layer size (all experts), 8K tokens
229+
python docs/streaming_analysis/stream_bench.py --layer-mb 2250 --pipeline-tokens 8192
230+
231+
# With NVMe read test
232+
python docs/streaming_analysis/stream_bench.py --layer-mb 470 --nvme /mnt/nvme
233+
234+
# Custom model dimensions
235+
python docs/streaming_analysis/stream_bench.py \
236+
--layer-mb 470 --hidden 8192 --intermediate 28672 \
237+
--pipeline-tokens 4096 --n-layers 20
238+
```
239+
240+
### Interpreting results
241+
242+
- **Test 1** (PCIe bandwidth): Should show ~11 GB/s pinned on Gen3, ~24 GB/s
243+
on Gen4. Pageable should be noticeably slower.
244+
- **Test 3** (matmul throughput): Shows actual TFLOPS on your GPU. Use this
245+
instead of the theoretical peak for planning.
246+
- **Test 4** (overlap): Single-shot overlap test. Should show >2x speedup when
247+
compute dominates transfer.
248+
- **Test 5** (full pipeline): The definitive test. Compare "pipeline overhead
249+
vs compute-only" — should be <5% when compute > transfer per layer.

0 commit comments

Comments
 (0)