Commit 48c2eca

and

committed

feat: Add partial residency for weight streaming

Compute VRAM budget at init time and keep as many leading layers on GPU as possible. Non-resident layers are double-buffered from CPU pinned memory. Both forward_streaming and backward_streaming handle the resident/streamed boundary correctly. Key changes: - _compute_residency() estimates available VRAM after fixed costs - _init_weight_streaming() only moves non-resident layers to CPU - _layer_forward() checks _n_resident to decide data source - forward_streaming/backward_streaming have two phases: resident (direct GPU access) and streamed (double-buffered) - from_quantized() accepts batch_size/seq_len hints for VRAM estimate 5 new tests verify: full residency, forced partial, forward/backward correctness, zero-resident fallback, and gradient consistency between partial and full residency modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1 parent fb5bcba commit 48c2ecaCopy full SHA for 48c2eca

2 files changed

+462

-66

lines changed

bitsandbytes
- kbit_lora.py
tests
- test_checkpoint.py

2 files changed

+462

-66

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 48c2eca

2 files changed

2 files changed

Uh oh!

File tree

2 files changed

2 files changed

0 commit comments