Commit 4ae1573
feat: Add streaming_quantize for HF→pre-quantized conversion
Two-pass streaming quantizer that converts a HuggingFace model checkpoint
to a pre-quantized safetensors file with minimal memory:
Pass 1: Parse safetensors shard headers (no GPU, no tensor loads) to get
tensor shapes, compute quantized output sizes using validated formulas,
build the safetensors header with all tensor offsets, write header and
pre-allocate the output file.
Pass 2: For each layer, load fp16 weights from HF checkpoint, quantize
on GPU using quantize_kbit, write packed/absmax/codebook at the pre-computed
file offsets. Only one layer on GPU at a time.
Supports dense (Llama, Mistral, Qwen) and MoE (Qwen3-MoE, GLM-4) models.
Handles sharded checkpoints via model.safetensors.index.json. Copies
config.json alongside the output for reproducibility.
Tests verify bitwise identity with in-memory save_quantized, metadata
match, loadability via from_quantized, and config.json copying.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 34d4dc3 commit 4ae1573
2 files changed
+619
-1
lines changed
0 commit comments