Skip to content

Commit 5263e72

Browse files
TimDettmersclaude
andcommitted
feat: Add weight streaming for CPU→GPU layer-by-layer weight transfer
Keep frozen quantized weights in CPU pinned memory and stream them to GPU one layer at a time during training, using a double-buffered async pipeline. While the GPU computes on one layer, the next layer's weights transfer via PCIe DMA on a dedicated CUDA stream. This reduces GPU memory for frozen base weights from O(n_layers) to O(1) — only 2 layers' worth of quantized data on GPU at any time. A 70B model's ~38 GB of NF4 weights shrinks to ~1 GB on GPU. Implementation: - KbitLoraModel: add weight_streaming parameter - _init_weight_streaming: moves quantized weights to CPU pinned memory, pre-allocates 2 GPU buffer slots and a copy stream - _forward_streaming: double-buffered pipeline (async prefetch next layer while computing current layer via checkpoint_cpu_offload) - _layer_forward: detects forward (no_grad) vs backward (enable_grad) to use pre-loaded buffer or sync-load weights respectively - train_qlora.py: add --weight-streaming flag (implies --cpu-offload) Requires cpu_offload=True so backward recomputes one layer at a time (otherwise autograd saves all layers' weights on GPU for backward). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c7fd057 commit 5263e72

File tree

2 files changed

+393
-71
lines changed

2 files changed

+393
-71
lines changed

0 commit comments

Comments
 (0)