Commit 4ae1573

and

committed

feat: Add streaming_quantize for HF→pre-quantized conversion

Two-pass streaming quantizer that converts a HuggingFace model checkpoint to a pre-quantized safetensors file with minimal memory: Pass 1: Parse safetensors shard headers (no GPU, no tensor loads) to get tensor shapes, compute quantized output sizes using validated formulas, build the safetensors header with all tensor offsets, write header and pre-allocate the output file. Pass 2: For each layer, load fp16 weights from HF checkpoint, quantize on GPU using quantize_kbit, write packed/absmax/codebook at the pre-computed file offsets. Only one layer on GPU at a time. Supports dense (Llama, Mistral, Qwen) and MoE (Qwen3-MoE, GLM-4) models. Handles sharded checkpoints via model.safetensors.index.json. Copies config.json alongside the output for reproducibility. Tests verify bitwise identity with in-memory save_quantized, metadata match, loadability via from_quantized, and config.json copying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1 parent 34d4dc3 commit 4ae1573Copy full SHA for 4ae1573

2 files changed

+619

-1

lines changed

bitsandbytes
- checkpoint.py
tests
- test_checkpoint.py

2 files changed

+619

-1

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 4ae1573

2 files changed

2 files changed

Uh oh!

File tree

2 files changed

2 files changed

0 commit comments