Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213
Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213mergennachin wants to merge 2 commits intomainfrom
Conversation
…uTorch Text-only export of Gemma 4 31B-IT to ExecuTorch with the CUDA backend and INT4/INT8 weight quantization via a new packing-agnostic quant/ framework. The quant/ package separates quantization into four concerns: - recipe.py: declarative QuantRecipe with regex FQN matching - quantize.py: produces CanonicalQuantizedWeight (min_max, HQQ) - serialize.py: save/load to safetensors with versioned headers - pack.py + pack_cuda.py: per-module packer dispatch for CUDA Two production recipes: "default" (INT4 min_max + INT8 embedding) and "sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ elsewhere). Sliding window attention uses a ring-buffer KV cache (2x window size) for the 50 sliding layers, saving memory for long sequences. The 10 full-attention layers use a standard flat KV cache. Includes C++ runner (main.cpp), eager inference script, and 60+ unit and integration tests across quant/ and pipeline test files.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19213
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ 1 Pending, 2 Unrelated FailuresAs of commit 9108a5b with merge base d8da621 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds a new Gemma 4 31B-IT example pipeline for ExecuTorch (CUDA backend), including a packing-agnostic quantization format + recipes, CUDA packers, export/inference scripts, a C++ runner, and CI coverage.
Changes:
- Introduces
examples/models/gemma4_31b/quant/with recipe → quantize → serialize → pack flow plus unit tests. - Adds Gemma 4 31B model implementation with hybrid attention and a sliding-window KV cache, plus export + eager inference entrypoints.
- Adds CUDA runner build targets and runs Gemma 4 31B tests in the CUDA GitHub Actions workflow.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/gemma4_31b/test_pipeline.py | CPU-only integration tests for quantize/save/load roundtrip and tiny checkpoint fixtures. |
| examples/models/gemma4_31b/test_cuda_pipeline.py | CUDA integration tests for pack/infer/export on a tiny model. |
| examples/models/gemma4_31b/sampler.py | GPU-side Gumbel-max sampler used by the exported model. |
| examples/models/gemma4_31b/quantize_and_save.py | CLI to quantize HF checkpoints and write packing-agnostic safetensors bundles + production recipes. |
| examples/models/gemma4_31b/quant/test_serialize.py | Unit tests for nibble packing and safetensors serialization format. |
| examples/models/gemma4_31b/quant/test_recipe.py | Unit tests for regex/layer-filter recipe matching + production recipe regression tests. |
| examples/models/gemma4_31b/quant/test_quantize.py | Unit tests for quantize_weight and quantize_model APIs (CPU + CUDA/HQQ paths). |
| examples/models/gemma4_31b/quant/test_pack_cuda.py | CUDA unit tests for int4/int8 packers and load-and-pack dispatcher behavior. |
| examples/models/gemma4_31b/quant/serialize.py | Canonical quantized weight format + safetensors save/load with versioned metadata. |
| examples/models/gemma4_31b/quant/recipe.py | Declarative quantization recipe/rule objects with regex FQN matching and optional layer filters. |
| examples/models/gemma4_31b/quant/quantize.py | Implements min-max and HQQ quantization into canonical (packing-free) representations. |
| examples/models/gemma4_31b/quant/pack_cuda.py | CUDA-specific packers converting canonical weights into torchao runtime tensor subclasses. |
| examples/models/gemma4_31b/quant/pack.py | Backend-agnostic pack dispatcher that assigns weights/buffers and calls module-type packers. |
| examples/models/gemma4_31b/quant/init.py | Public API re-exports for quant/ package. |
| examples/models/gemma4_31b/quant/README.md | Documentation of the quant framework, data flow, and backend extension points. |
| examples/models/gemma4_31b/model.py | Gemma 4 31B model definition, HF checkpoint loader, ring KV cache for sliding layers, runtime buffer materialization. |
| examples/models/gemma4_31b/model.md | Architecture/design notes for model + quant pipeline. |
| examples/models/gemma4_31b/main.cpp | ExecuTorch CUDA runner driving exported prefill/decode and HF tokenizer decoding. |
| examples/models/gemma4_31b/inference.py | Eager CUDA inference script loading prequantized weights, packing, and generating text. |
| examples/models/gemma4_31b/export.py | Export + lowering pipeline (decode + prefill methods) targeting the CUDA backend. |
| examples/models/gemma4_31b/init.py | Package marker for the new model example. |
| examples/models/gemma4_31b/README.md | User-facing instructions for quantize/export/inference/build/run workflows. |
| examples/models/gemma4_31b/CMakePresets.json | CMake preset for building the Gemma 4 31B CUDA runner. |
| examples/models/gemma4_31b/CMakeLists.txt | CMake build for the Gemma 4 31B runner, linking ExecuTorch + CUDA backend + tokenizer. |
| examples/models/gemma4/text_decoder/gemma4_norm.py | Replaces transformers RMSNorm dependency with a self-contained implementation. |
| examples/models/gemma4/text_decoder/init.py | Exposes attention/norm/MLP primitives used by gemma4_31b for shared numerically-sensitive ops. |
| Makefile | Adds gemma4_31b-cuda build target. |
| .github/workflows/cuda.yml | Adds Gemma 4 31B quant + pipeline tests to the CUDA CI job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Sliding window layers use RingKVCache (2×window) instead of flat max_seq_len buffer, reducing KV cache memory for long sequences. - Prefill is capped to ring buffer size; the C++ runner chunks longer prompts automatically via get_max_prefill_chunk metadata. - Both recipes now quantize embed_tokens to INT8 per-axis (~1.4 GB savings vs bf16). Embedding packer uses IntxUnpackedToInt8Tensor which supports gather. - pack_model handles top-level FQNs (no parent module). - C++ runner aligned with Qwen patterns: #ifdef guards for non-CUDA builds, better weight_sharing error handling, cudaDeviceSynchronize between prefill and decode. - Test suite split into test_pipeline.py (CPU) and test_cuda_pipeline.py (CUDA) with shared fixtures. New chunked prefill correctness test. - Prequantized checkpoint available at huggingface.co/SocialLocalMobile/gemma-4-31B-it-HQQ-INT4. - Added Gemma 4 31B tests to cuda.yml CI workflow. - Cleaned up stale terminology, docstrings, and comments throughout.
Text-only export of Gemma 4 31B-IT to ExecuTorch with the CUDA backend
and INT4/INT8 weight quantization via a new packing-agnostic quant/
framework.
The quant/ package separates quantization into four concerns:
This concept will be promoted and used for Qwen3_5_moe and other desktop/laptop models.
Two production recipes: "default" (INT4 min_max + INT8 embedding) and
"sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ elsewhere).
Sliding window attention uses a ring-buffer KV cache (2x window size)
for the 50 sliding layers, saving memory for long sequences. The 10
full-attention layers use a standard flat KV cache.
Includes C++ runner (main.cpp), eager inference script, and 60+ unit
and integration tests across quant/ and pipeline test files.
Uses this model: https://huggingface.co/SocialLocalMobile/gemma-4-31B-it-HQQ-INT4