Commit 9e36d62
Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch (pytorch#19213)
Text-only export of Gemma 4 31B-IT to ExecuTorch with INT4/INT8 weight
quantization. Quantized weights use torchao's native tensor subclasses
(Int4Tensor, IntxUnpackedToInt8Tensor) for serialization, aligning with
the torchao ecosystem.
quant/ package separates quantization into independent modules:
- recipe.py: declarative QuantRecipe with regex FQN matching and
per-layer overrides
- quantize.py: quantize_weight / dequantize_weight / quantize_model —
returns torchao subclasses directly. 8-bit fully delegates to
IntxUnpackedToInt8Tensor.from_hp (min_max and HQQ). 4-bit uses torchao
primitives + manual Int4Tensor construction (pending mslk availability
for from_hp)
- pack.py: pack_model (bulk, groups by parent for MoE) and pack_one
(streaming). Dispatches via isinstance(_, TorchAOBaseTensor)
- pack_cuda.py: converts Int4Tensor to IntxUnpackedToInt8Tensor (int4
values unpacked to int8) and passes INT8 IntxUnpackedToInt8Tensor
through unchanged. No CUDA required for packing — the CUDA-specific
tinygemm conversion is a source transform applied at export time
- gguf.py: unpack Q4_K/Q6_K GGUF blocks directly to
Int4Tensor/IntxUnpackedToInt8Tensor, with streaming iterator
Serialization uses torchao's safetensors integration
(torchao.prototype.safetensors) — no custom format. Checkpoints are
compatible with torchao's save_pretrained/load_pretrained and can be
loaded by vLLM.
This framework is designed to be promoted and reused for Qwen 3.5 MoE
and other models — adding a new model requires only a QuantRecipe and
optionally a custom packer.
Quantization recipes: "default" (INT4 min_max linears + INT8 per-axis
embedding) and "sensitive" (INT8 for edge-layer v_proj/down_proj, INT4
HQQ asymmetric elsewhere).
Dual-path INT4 linear dispatch: IntxUnpackedToInt8Tensor's F.linear
dispatch dequantizes to bf16 and calls cuBLAS, optimal for prefill (12x
faster than tinygemm at T=2048). For decode, a model-agnostic source
transform (backends/cuda/transforms/int4_linear_dispatch.py) converts to
Int4TilePackedTo4dTensor (tinygemm), optimal for M=1. Export flow:
prefill first (dequant+cuBLAS), then tinygemm transform, then decode
export. inference.py applies the tinygemm transform for fast eager
decode.
Split-K flash-decoding: ReplaceEdgeOpWithTritonOpPass in the CUDA
backend selects triton::sdpa_decode_splitk for SDPA nodes where L_q=1
and L_kv exceeds 2048. At 128K context, full-attention decode SDPA
improves from 15.7ms/layer to 0.7ms/layer (22x). Sliding-window layers
(ring buffer <= 2048) use standard triton::sdpa. No model code changes —
the pass inspects Q/K shapes in the exported graph automatically.
GGUF support: inference.py --gguf and export.py --gguf load
community-quantized GGUF files directly. Tied embed/lm_head is untied —
embedding dequantized to bf16 for gather, lm_head keeps INT4 for matmul.
Ring-buffer KV cache: Sliding window layers use RingKVCache (2x window)
instead of flat max_seq_len buffers. The C++ runner chunks long prompts
automatically via get_max_prefill_chunk metadata. Chunked prefill
produces identical logits to sequential (verified by test).
Includes: C++ runner with BOS/EOS handling, chunked prefill, and #ifdef
guards for non-CUDA builds; eager inference with torch.compile; unit and
integration tests across quant/tests/, tests/, and backends/cuda/tests/.
```
┌──────────────────┬────────────────────┐
│ Metric │ Value │
├──────────────────┼────────────────────┤
│ Prompt tokens │ 513 │
├──────────────────┼────────────────────┤
│ Generated tokens │ 128 │
├──────────────────┼────────────────────┤
│ Prefill │ 766 tok/s (670ms) │
├──────────────────┼────────────────────┤
│ Decode │ 21.5 tok/s │
├──────────────────┼────────────────────┤
│ TTFT │ 89ms │
├──────────────────┼────────────────────┤
│ GPU peak │ 25.1GB │
├──────────────────┼────────────────────┤
│ Model load │ 28.8s │
└──────────────────┴────────────────────┘
```
---------
Co-authored-by: mnachin <mnachin@fb.com>1 parent aa0d465 commit 9e36d62
41 files changed
Lines changed: 6404 additions & 3 deletions
File tree
- .github/workflows
- backends/cuda
- runtime/shims
- tests
- tests
- triton
- examples/models
- gemma4_31b
- quant
- tests
- tests
- gemma4/text_decoder
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
151 | 155 | | |
152 | 156 | | |
153 | 157 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
126 | 126 | | |
127 | 127 | | |
128 | 128 | | |
| 129 | + | |
129 | 130 | | |
130 | 131 | | |
131 | 132 | | |
| |||
425 | 426 | | |
426 | 427 | | |
427 | 428 | | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
428 | 438 | | |
429 | 439 | | |
430 | 440 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
113 | | - | |
| 113 | + | |
| 114 | + | |
114 | 115 | | |
115 | 116 | | |
116 | 117 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
226 | 226 | | |
227 | 227 | | |
228 | 228 | | |
| 229 | + | |
| 230 | + | |
229 | 231 | | |
230 | 232 | | |
231 | 233 | | |
| |||
298 | 300 | | |
299 | 301 | | |
300 | 302 | | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
301 | 317 | | |
302 | 318 | | |
303 | 319 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
0 commit comments