Commit 2f04ac7
docs: Add GLM-4.7 355B streaming simulation and hardware analysis
Complete simulation of QLoRA fine-tuning with weight streaming for
GLM-4.7 355B MoE, calibrated against RTX 4090 matmul benchmarks.
- streaming_sim.py: Full simulation with memory budgets, compute/transfer
overlap, optimal resident/batch sweep across 5 GPUs, 6 quant formats
(NF4, NF3, NF2, NF4d+NF2e, NF4d+NF3e, NVFP4), 7 storage configs,
and pipeline parallelism modeling
- bench_matmul.py: BF16 and NF4 dequant+matmul benchmarks for GPU
utilization calibration (measured 81-97% on RTX 4090)
- GLM47_ANALYSIS.md: Complete analysis document covering the resident/batch
trade-off, NVFP4 on Blackwell (2.74x effective speedup), AM5 x8/x8
validation, and 4 hardware build recommendations ($2.7K-$7.6K)
Key finding: optimal resident/batch split achieves 0% streaming overhead
across all tested configurations. GPU utilization calibrated at 70%
(conservative vs measured 81-97%).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 5263e72 commit 2f04ac7
File tree
3 files changed
+2001
-0
lines changed- docs/streaming_analysis
3 files changed
+2001
-0
lines changed
0 commit comments