Commit 48c2eca
feat: Add partial residency for weight streaming
Compute VRAM budget at init time and keep as many leading layers
on GPU as possible. Non-resident layers are double-buffered from
CPU pinned memory. Both forward_streaming and backward_streaming
handle the resident/streamed boundary correctly.
Key changes:
- _compute_residency() estimates available VRAM after fixed costs
- _init_weight_streaming() only moves non-resident layers to CPU
- _layer_forward() checks _n_resident to decide data source
- forward_streaming/backward_streaming have two phases: resident
(direct GPU access) and streamed (double-buffered)
- from_quantized() accepts batch_size/seq_len hints for VRAM estimate
5 new tests verify: full residency, forced partial, forward/backward
correctness, zero-resident fallback, and gradient consistency
between partial and full residency modes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent fb5bcba commit 48c2eca
2 files changed
+462
-66
lines changed
0 commit comments