|
| 1 | +# CPU→GPU Weight Streaming for QLoRA Training |
| 2 | + |
| 3 | +This document analyzes the feasibility of streaming frozen base weights from |
| 4 | +CPU DRAM (or NVMe) to GPU during QLoRA training, eliminating the need to hold |
| 5 | +the full model in VRAM. The analysis covers a theoretical bandwidth model, |
| 6 | +empirical validation on an RTX 4090, and configuration grids for multiple |
| 7 | +hardware configurations. |
| 8 | + |
| 9 | +## Key Result |
| 10 | + |
| 11 | +**Weight streaming works with near-zero overhead** once per-layer compute time |
| 12 | +exceeds the PCIe transfer time. On an RTX 4090 with PCIe 3.0 x16, the |
| 13 | +crossover is approximately 4K tokens per step. Above this, the double-buffered |
| 14 | +pipeline hides 100% of the transfer latency with <0.5% measured overhead. |
| 15 | + |
| 16 | +## Architecture |
| 17 | + |
| 18 | +QLoRA freezes the base model weights and only trains low-rank adapters. The |
| 19 | +frozen weights are read-only during both forward and backward passes, making |
| 20 | +them ideal candidates for streaming from slower storage tiers. |
| 21 | + |
| 22 | +### Three-tier pipeline |
| 23 | + |
| 24 | +``` |
| 25 | +NVMe SSD ──(3.5-14 GB/s)──> CPU DRAM ──(11-50 GB/s PCIe)──> GPU VRAM |
| 26 | + cold storage bandwidth buffer compute |
| 27 | +``` |
| 28 | + |
| 29 | +The GPU maintains a **double buffer** (2 layer slots). While computing on one |
| 30 | +layer, the next layer transfers asynchronously from CPU DRAM via PCIe DMA on a |
| 31 | +dedicated CUDA stream. The CPU DRAM buffer decouples the NVMe and GPU rates. |
| 32 | + |
| 33 | +### Double-buffer operation |
| 34 | + |
| 35 | +``` |
| 36 | +Time ──────────────────────────────────────────────────────> |
| 37 | +GPU: [compute L0] [compute L1] [compute L2] [compute L3] |
| 38 | +PCIe: [xfer L1 ] [xfer L2 ] [xfer L3 ] [xfer L4 ] |
| 39 | +NVMe→CPU: [read L2 ] [read L3 ] [read L4 ] |
| 40 | +``` |
| 41 | + |
| 42 | +Each layer occupies one slot. After the GPU finishes computing a layer, that |
| 43 | +slot is freed for the next incoming transfer. No GPU idle time occurs as long |
| 44 | +as `compute_time >= transfer_time`. |
| 45 | + |
| 46 | +## Theoretical Model |
| 47 | + |
| 48 | +### Per-layer timing |
| 49 | + |
| 50 | +For a transformer layer with `P_active` active parameters: |
| 51 | + |
| 52 | +``` |
| 53 | +compute_ms = tokens × 3 × 2 × P_active / GPU_FLOPS × 1000 |
| 54 | + ↑ ↑ |
| 55 | + │ └─ 2 FLOPs per multiply-accumulate |
| 56 | + └───── 3× for training (forward + backward ≈ 3× forward) |
| 57 | +
|
| 58 | +transfer_ms = layer_size_bytes / pcie_bandwidth × 1000 |
| 59 | +``` |
| 60 | + |
| 61 | +For MoE models, `P_active` is the active subset (routed experts + attention + |
| 62 | +shared expert), while `layer_size_bytes` includes **all** experts since the |
| 63 | +routing decision is token-dependent. |
| 64 | + |
| 65 | +### GPU ring buffer sizing |
| 66 | + |
| 67 | +``` |
| 68 | +K_ring = max(2, ceil(transfer_ms / compute_ms) + 1) |
| 69 | +``` |
| 70 | + |
| 71 | +The ring buffer holds `K_ring` layers. The GPU processes `K_ring - 1` layers |
| 72 | +while one layer transfers. When `compute_ms > transfer_ms`, `K_ring = 2` |
| 73 | +(double buffer) suffices. |
| 74 | + |
| 75 | +### NVMe CPU buffer sizing |
| 76 | + |
| 77 | +The CPU DRAM buffer must absorb the rate mismatch between NVMe reads and GPU |
| 78 | +consumption. Over the total processing time, NVMe delivers: |
| 79 | + |
| 80 | +``` |
| 81 | +nvme_delivered = n_layers × layer_cycle_ms / nvme_ms |
| 82 | +cpu_buffer_layers = n_layers - K_ring - nvme_delivered |
| 83 | +``` |
| 84 | + |
| 85 | +Where `layer_cycle_ms = max(compute_ms, transfer_ms)`. |
| 86 | + |
| 87 | +### Pipeline throughput |
| 88 | + |
| 89 | +``` |
| 90 | +step_time = n_layers × max(compute_ms, transfer_ms, nvme_ms) |
| 91 | +``` |
| 92 | + |
| 93 | +The slowest leg (compute, PCIe, or NVMe) determines throughput. Buffering |
| 94 | +shifts when the bottleneck hits, but doesn't change the steady-state rate. |
| 95 | + |
| 96 | +## Measured Hardware Parameters (RTX 4090 + PCIe 3.0) |
| 97 | + |
| 98 | +| Parameter | Theoretical | Measured | |
| 99 | +|---|---|---| |
| 100 | +| PCIe Gen3 x16 H2D (pinned) | 13 GB/s | **11 GB/s** (85% eff.) | |
| 101 | +| PCIe Gen3 x16 H2D (pageable) | — | **7 GB/s** | |
| 102 | +| GPU FP16 tensor throughput | 330 TFLOPS | **160 TFLOPS** | |
| 103 | +| Transfer/layer (470 MB, Llama-70B) | 36ms | **43ms** | |
| 104 | +| Compute/layer @ 4K tokens | 61ms | **50ms** | |
| 105 | +| Compute/layer @ 8K tokens | 122ms | **100ms** | |
| 106 | +| Pipeline overhead (compute > transfer) | 0% | **<0.5%** | |
| 107 | + |
| 108 | +The GPU achieves ~160 TFLOPS on these matmul shapes (not the peak 330 TFLOPS, |
| 109 | +which requires ideal tile sizes). PCIe runs at 85% of theoretical due to |
| 110 | +protocol overhead. Both deviations are consistent and predictable. |
| 111 | + |
| 112 | +## Benchmark Results |
| 113 | + |
| 114 | +### Pipeline overhead vs. batch size (Llama-70B, 470 MB/layer) |
| 115 | + |
| 116 | +| Tokens | Compute/layer | Transfer/layer | Overhead | Verdict | |
| 117 | +|--------|--------------|----------------|----------|---------| |
| 118 | +| 512 | 6.4ms | 42ms | +536% | PCIe-limited | |
| 119 | +| 1024 | 12.6ms | 43ms | +224% | PCIe-limited | |
| 120 | +| 2048 | 24.5ms | 43ms | +67% | PCIe-limited | |
| 121 | +| **4096** | **50ms** | **43ms** | **+0.2%** | **Fully hidden** | |
| 122 | +| 8192 | 100ms | 42ms | +0.4% | Fully hidden | |
| 123 | + |
| 124 | +The crossover is sharp: below ~4K tokens the GPU idles waiting for PCIe; |
| 125 | +above it, transfers are completely hidden behind compute. |
| 126 | + |
| 127 | +### Memory savings |
| 128 | + |
| 129 | +For the pipeline test with 20 × 470 MB layers (9.2 GB total weights): |
| 130 | + |
| 131 | +- GPU ring buffer (2 layers): **0.92 GB** |
| 132 | +- GPU peak memory: **3.55 GB** (ring + activations + compute buffers) |
| 133 | +- VRAM savings: **90%** |
| 134 | + |
| 135 | +Extrapolated to full Llama-70B (80 layers, 38 GB total): |
| 136 | + |
| 137 | +- GPU ring buffer: **0.94 GB** (2 layers) |
| 138 | +- Remaining 78 layers: in CPU DRAM or NVMe |
| 139 | +- VRAM savings: **97.5%** |
| 140 | + |
| 141 | +## Configuration Grids |
| 142 | + |
| 143 | +### Llama-70B (dense, 80 layers, 856M params/layer, 38 GB NF4) |
| 144 | + |
| 145 | +Minimum feasible batch size per hardware configuration: |
| 146 | + |
| 147 | +| NVMe config | Gen3 x16 PCIe | Gen4 x16 | Gen5 x16 | |
| 148 | +|---|---|---|---| |
| 149 | +| 1× Gen3 (3.5 GB/s) | 1K (11s/step) | 4K (11s) | 4K (11s) | |
| 150 | +| 2× Gen3 (7 GB/s) | 1K (5s) | 1K (5s) | 2K (5s) | |
| 151 | +| 1× Gen4 (7 GB/s) | 1K (5s) | 1K (5s) | 2K (5s) | |
| 152 | +| 2× Gen5 (28 GB/s) | n/a | n/a | 1K (1s) | |
| 153 | + |
| 154 | +Dense models are straightforward — 100% of transferred weights contribute to |
| 155 | +compute, so even slow NVMe works at small batch sizes. |
| 156 | + |
| 157 | +### GLM-4.7 (355B MoE, 92 layers, 4.1B params/layer, 207 GB NF4) |
| 158 | + |
| 159 | +The MoE architecture creates a poor weight-to-compute ratio: each layer |
| 160 | +transfers 2.25 GB (all 160 experts) but only 12.5% (8 active experts + |
| 161 | +attention + shared) contributes FLOPs. This makes the model significantly |
| 162 | +harder to stream. |
| 163 | + |
| 164 | +**With 32 GB system RAM:** |
| 165 | + |
| 166 | +| NVMe config | Gen4 x16 | Gen5 x16 | |
| 167 | +|---|---|---| |
| 168 | +| 1× Gen4 (7 GB/s) | 32K (30s) | 32K (30s) | |
| 169 | +| 2× Gen4 (14 GB/s) | 16K (15s) | 16K (15s) | |
| 170 | +| 2× Gen5 (28 GB/s) | n/a | **8K (7s)** | |
| 171 | +| 4× Gen4 (28 GB/s) | n/a | **8K (7s)** | |
| 172 | + |
| 173 | +**With 128 GB system RAM** (larger CPU buffer absorbs NVMe rate mismatch): |
| 174 | + |
| 175 | +| NVMe config | Gen4 x16 | Gen5 x16 | |
| 176 | +|---|---|---| |
| 177 | +| 2× Gen4 (14 GB/s) | 2K (15s) | 8K (15s) | |
| 178 | +| 2× Gen5 (28 GB/s) | n/a | **1K (7s)** | |
| 179 | +| 4× Gen4 (28 GB/s) | n/a | **1K (7s)** | |
| 180 | + |
| 181 | +### Effect of lower-bit quantization on GLM-4.7 |
| 182 | + |
| 183 | +Lower quantization reduces transfer size without changing compute (weights |
| 184 | +are dequantized to FP16 before matmul): |
| 185 | + |
| 186 | +| Quantization | Layer size | Model size | Transfer/layer | Min tokens (0% overhead) | |
| 187 | +|---|---|---|---|---| |
| 188 | +| NF4 (4-bit) | 2.25 GB | 207 GB | 205ms | ~11K | |
| 189 | +| NF3 (3-bit) | 1.64 GB | 151 GB | 149ms | ~8K | |
| 190 | +| NF2 (2-bit) | 1.15 GB | 106 GB | 104ms | ~5K | |
| 191 | + |
| 192 | +At NF2, the 355B MoE model's per-layer transfer time approaches that of a 70B |
| 193 | +dense model at NF4, making streaming much more practical. |
| 194 | + |
| 195 | +## Implementation Notes |
| 196 | + |
| 197 | +### Critical for correct overlap |
| 198 | + |
| 199 | +1. **Pinned memory**: CPU buffers must use `pin_memory=True`. Pageable memory |
| 200 | + drops bandwidth from 11 GB/s to 7 GB/s and prevents true async DMA. |
| 201 | + |
| 202 | +2. **Pre-allocated output buffers**: Use `torch.mm(A, B, out=C)` instead of |
| 203 | + `C = torch.mm(A, B)`. Temporary tensor allocations cause implicit CUDA |
| 204 | + synchronizations that serialize the pipeline. In testing, this single change |
| 205 | + reduced pipeline overhead from 78% to <0.5%. |
| 206 | + |
| 207 | +3. **Dedicated copy stream**: Use a separate `torch.cuda.Stream()` for H2D |
| 208 | + transfers. The default stream serializes all operations. |
| 209 | + |
| 210 | +4. **Stream synchronization**: After compute, call |
| 211 | + `torch.cuda.current_stream().wait_stream(copy_stream)` before the next |
| 212 | + iteration to ensure the incoming layer is ready. |
| 213 | + |
| 214 | +### What doesn't work |
| 215 | + |
| 216 | +- **torch.mm() without `out=`** in the pipeline loop — causes CUDA allocator |
| 217 | + syncs, defeating the overlap. |
| 218 | +- **Pageable (non-pinned) CPU memory** — the CUDA runtime copies through an |
| 219 | + internal staging buffer, halving bandwidth and preventing overlap. |
| 220 | +- **Single CUDA stream** — serializes compute and transfer. |
| 221 | + |
| 222 | +## Running the Benchmark |
| 223 | + |
| 224 | +```bash |
| 225 | +# Llama-70B layer size, 4K tokens (should show ~0% overhead) |
| 226 | +python docs/streaming_analysis/stream_bench.py --layer-mb 470 --pipeline-tokens 4096 |
| 227 | + |
| 228 | +# GLM-4.7 layer size (all experts), 8K tokens |
| 229 | +python docs/streaming_analysis/stream_bench.py --layer-mb 2250 --pipeline-tokens 8192 |
| 230 | + |
| 231 | +# With NVMe read test |
| 232 | +python docs/streaming_analysis/stream_bench.py --layer-mb 470 --nvme /mnt/nvme |
| 233 | + |
| 234 | +# Custom model dimensions |
| 235 | +python docs/streaming_analysis/stream_bench.py \ |
| 236 | + --layer-mb 470 --hidden 8192 --intermediate 28672 \ |
| 237 | + --pipeline-tokens 4096 --n-layers 20 |
| 238 | +``` |
| 239 | + |
| 240 | +### Interpreting results |
| 241 | + |
| 242 | +- **Test 1** (PCIe bandwidth): Should show ~11 GB/s pinned on Gen3, ~24 GB/s |
| 243 | + on Gen4. Pageable should be noticeably slower. |
| 244 | +- **Test 3** (matmul throughput): Shows actual TFLOPS on your GPU. Use this |
| 245 | + instead of the theoretical peak for planning. |
| 246 | +- **Test 4** (overlap): Single-shot overlap test. Should show >2x speedup when |
| 247 | + compute dominates transfer. |
| 248 | +- **Test 5** (full pipeline): The definitive test. Compare "pipeline overhead |
| 249 | + vs compute-only" — should be <5% when compute > transfer per layer. |
0 commit comments