Skip to content

Commit 48c2eca

Browse files
TimDettmersclaude
andcommitted
feat: Add partial residency for weight streaming
Compute VRAM budget at init time and keep as many leading layers on GPU as possible. Non-resident layers are double-buffered from CPU pinned memory. Both forward_streaming and backward_streaming handle the resident/streamed boundary correctly. Key changes: - _compute_residency() estimates available VRAM after fixed costs - _init_weight_streaming() only moves non-resident layers to CPU - _layer_forward() checks _n_resident to decide data source - forward_streaming/backward_streaming have two phases: resident (direct GPU access) and streamed (double-buffered) - from_quantized() accepts batch_size/seq_len hints for VRAM estimate 5 new tests verify: full residency, forced partial, forward/backward correctness, zero-resident fallback, and gradient consistency between partial and full residency modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent fb5bcba commit 48c2eca

File tree

2 files changed

+462
-66
lines changed

2 files changed

+462
-66
lines changed

0 commit comments

Comments
 (0)