11# quant/
22
3- Packing-agnostic quantization framework: ** recipe → quantize → serialize → pack** .
3+ Quantization framework: ** recipe → quantize → pack** .
44
55## Files
66
77| File | Concern | Depends on |
88| ---| ---| ---|
99| ` recipe.py ` | ** Policy** — what to quantize, what precision, which layers | nothing |
10- | ` quantize.py ` | ** Computation** — produces/dequantizes canonical weights | recipe, torchao |
11- | ` serialize.py ` | ** Data format** — saves/loads canonical weights to safetensors | recipe |
12- | ` pack.py ` | ** Packing dispatch** — ` pack_model ` (bulk) and ` pack_one ` (streaming) | serialize |
13- | ` pack_cuda.py ` | ** CUDA packing** — converts canonical to tinygemm/intx runtime format | pack, serialize |
14- | ` gguf.py ` | ** GGUF import** — unpacks Q4_K/Q6_K blocks to canonical form | recipe, serialize |
10+ | ` quantize.py ` | ** Computation** — produces torchao subclass tensors | recipe, torchao |
11+ | ` pack.py ` | ** Packing dispatch** — ` pack_model ` (bulk) and ` pack_one ` (streaming) | — |
12+ | ` pack_cuda.py ` | ** CUDA packing** — converts Int4Tensor to tinygemm format | pack |
13+ | ` gguf.py ` | ** GGUF import** — unpacks Q4_K/Q6_K blocks to torchao subclasses | torchao |
1514
1615## Data flow
1716
1817```
19- QuantRecipe → quantize_model() → CanonicalQuantizedWeight → save() → file → load() → CanonicalQuantizedWeight → pack_model() → runtime model
18+ QuantRecipe → quantize_model() → state_dict{Int4Tensor, IntxUnpackedToInt8Tensor, Tensor} → safetensors → state_dict → pack_model() → runtime model
2019```
2120
22- ` CanonicalQuantizedWeight ` is the interchange point — int8 qdata + bf16
23- scale + optional zero + config. Everything left of it is backend-agnostic.
24- Everything right is backend-specific.
21+ Quantized weights are stored as torchao tensor subclasses:
22+ - ** Int4Tensor** — 4-bit weights (nibble-packed qdata + transposed scale/zero_point)
23+ - ** IntxUnpackedToInt8Tensor** — 8-bit weights (int8 qdata + scale + zero_point)
24+
25+ These are the canonical interchange formats from torchao. Everything left
26+ of ` save() ` is backend-agnostic. Everything right is backend-specific.
2527
2628## Adding a new backend
2729
@@ -32,56 +34,21 @@ def pack_linear_for_metal(module, weights): ...
3234DEFAULT_METAL_PACKERS = {nn.Linear: pack_linear_for_metal}
3335```
3436
35- Call ` pack_model(model, quantized, unquantized, packers=DEFAULT_METAL_PACKERS) ` .
36- No changes to recipe, quantize, or serialize.
37-
38- Things to consider:
39-
40- - ** Recipes may need to be backend-aware.** Each backend's kernels have
41- different constraints (e.g., Metal's ` fpa4w ` is INT4-only — no INT8 linear
42- kernel, so the sensitive recipe's 8-bit edge layers would need to be INT4
43- or dequantized to bf16). Define per-backend recipes or validate recipe
44- compatibility at pack time.
45- - ** Source transforms before packing.** Some backends replace model modules
46- (e.g., MLX swaps ` FusedMoEExperts ` → ` SwitchMLP ` , Metal swaps to
47- ` MetalMoEExperts ` ). These transforms change the module types that
48- packers dispatch on, so they must run before ` pack_model() ` . For dense
49- models (no MoE) this is not needed.
50- - ** Embedding quantization.** Not all backends have a quantized embedding
51- gather kernel. The packer can dequantize to bf16 at load time — the
52- disk savings from the canonical format still apply.
53-
54- ## Adding a new model
55-
56- 1 . Define a ` QuantRecipe ` with rules for the model's FQN patterns.
57- 2 . If the model has custom module types (e.g., ` FusedMoEExperts ` ), write a
58- per-module packer and extend the packers dict:
59- ``` python
60- packers = {** DEFAULT_CUDA_PACKERS , FusedMoEExperts: pack_moe_experts}
61- ```
62- 3 . No changes to the quant package itself.
37+ Call ` pack_model(model, state_dict, packers=DEFAULT_METAL_PACKERS) ` .
38+ No changes to recipe or quantize.
6339
6440## On-disk format
6541
66- Safetensors with a ` format_version ` in the header. Per quantized weight:
67- ` {fqn}.qdata ` (int8, nibble-packed for 4-bit), ` {fqn}.scale ` (bf16),
68- optionally ` {fqn}.zero ` (bf16). Header JSON records bits, group_size,
69- symmetric, and method per weight. Unquantized weights stored as-is.
42+ Uses torchao's safetensors integration (` torchao.prototype.safetensors ` ).
43+ Each tensor subclass is decomposed into its inner tensors
44+ (e.g., ` layer._weight_qdata ` , ` layer._weight_scale ` ) plus JSON metadata
45+ recording the subclass type and attributes. Plain tensors are stored as-is.
46+ The format is compatible with torchao's ` save_pretrained ` / ` load_pretrained ` .
7047
7148## TODO
7249
73- - ` pack_metal.py ` — Metal backend packer. Convert canonical INT4 to
74- ` UIntxWeightOnlyConfig ` subclass (torchao experimental) for the
75- ` torchao::_linear_fp_act_4bit_weight ` kernel. For MoE models, pack
76- expert weights into Metal's ` gather_qmv ` format (asymmetric, unsigned
77- INT4 with scale + bias buffers).
78-
79- - ` pack_mlx.py ` — MLX backend packer. Convert canonical INT4 to
80- ` IntxWeightOnlyConfig ` subclass for the ` mlx::gather_qmm ` kernel.
81- For MoE models, stack per-expert weights into ` SwitchLinear ` format.
82-
83- - ` gguf.py ` — extend with Q5_K, Q8_0, and other GGUF quant types.
84- Currently supports Q4_K and Q6_K. Some Q4_K_M files also contain
85- Q5_K or Q8_0 tensors (for sensitive layers on certain architectures)
86- which will raise — add support as needed. Q6_K is widened to 8-bit
87- for CUDA packing since there is no 6-bit CUDA kernel.
50+ - ` pack_metal.py ` — Metal backend packer.
51+ - ` pack_mlx.py ` — MLX backend packer.
52+ - ` gguf.py ` — extend with Q5_K, Q8_0 GGUF quant types.
53+ - Upstream ` Int4TilePackedTo4dTensor.from_int4_tensor() ` to torchao
54+ to replace the manual conversion in ` pack_int4_for_cuda ` .
0 commit comments