Skip to content

Commit 4ae1573

Browse files
TimDettmersclaude
andcommitted
feat: Add streaming_quantize for HF→pre-quantized conversion
Two-pass streaming quantizer that converts a HuggingFace model checkpoint to a pre-quantized safetensors file with minimal memory: Pass 1: Parse safetensors shard headers (no GPU, no tensor loads) to get tensor shapes, compute quantized output sizes using validated formulas, build the safetensors header with all tensor offsets, write header and pre-allocate the output file. Pass 2: For each layer, load fp16 weights from HF checkpoint, quantize on GPU using quantize_kbit, write packed/absmax/codebook at the pre-computed file offsets. Only one layer on GPU at a time. Supports dense (Llama, Mistral, Qwen) and MoE (Qwen3-MoE, GLM-4) models. Handles sharded checkpoints via model.safetensors.index.json. Copies config.json alongside the output for reproducibility. Tests verify bitwise identity with in-memory save_quantized, metadata match, loadability via from_quantized, and config.json copying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 34d4dc3 commit 4ae1573

File tree

2 files changed

+619
-1
lines changed

2 files changed

+619
-1
lines changed

0 commit comments

Comments
 (0)