Commit dc521ef
feat: Add RAM strategy auto-detection (pinned/hybrid/mmap)
Automatically choose the best streaming backend based on available
system RAM. Three strategies:
- Pinned: pre-load all streamed layers to CPU pinned memory (fast)
- Hybrid: pin as many layers as fit, mmap the rest from safetensors
- Mmap: all layers loaded on demand from safetensors with staging buffers
Key changes:
- get_available_ram_bytes() reads MemAvailable from /proc/meminfo
- _init_weight_streaming() detects strategy and initializes accordingly
- _stream_load_layer() dispatches to pinned or mmap path
- _mmap_load_to_gpu() loads from safetensors → staging → GPU
- from_quantized() builds tensor name maps for mmap lookups
- Staging buffers are CPU pinned, sized for largest streamed layer
6 new tests verify: default pinned, forced mmap, forced hybrid,
mmap forward/backward, hybrid forward/backward, and gradient
consistency between pinned and mmap paths.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 48c2eca commit dc521ef
2 files changed
+425
-51
lines changed
0 commit comments