Commit 5263e72
feat: Add weight streaming for CPU→GPU layer-by-layer weight transfer
Keep frozen quantized weights in CPU pinned memory and stream them to
GPU one layer at a time during training, using a double-buffered async
pipeline. While the GPU computes on one layer, the next layer's weights
transfer via PCIe DMA on a dedicated CUDA stream.
This reduces GPU memory for frozen base weights from O(n_layers) to
O(1) — only 2 layers' worth of quantized data on GPU at any time.
A 70B model's ~38 GB of NF4 weights shrinks to ~1 GB on GPU.
Implementation:
- KbitLoraModel: add weight_streaming parameter
- _init_weight_streaming: moves quantized weights to CPU pinned memory,
pre-allocates 2 GPU buffer slots and a copy stream
- _forward_streaming: double-buffered pipeline (async prefetch next
layer while computing current layer via checkpoint_cpu_offload)
- _layer_forward: detects forward (no_grad) vs backward (enable_grad)
to use pre-loaded buffer or sync-load weights respectively
- train_qlora.py: add --weight-streaming flag (implies --cpu-offload)
Requires cpu_offload=True so backward recomputes one layer at a time
(otherwise autograd saves all layers' weights on GPU for backward).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent c7fd057 commit 5263e72
2 files changed
+393
-71
lines changed
0 commit comments