Skip to content

Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213

Open
mergennachin wants to merge 2 commits intomainfrom
gemma4-31b-quant-framework
Open

Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213
mergennachin wants to merge 2 commits intomainfrom
gemma4-31b-quant-framework

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

@mergennachin mergennachin commented Apr 29, 2026

Text-only export of Gemma 4 31B-IT to ExecuTorch with the CUDA backend
and INT4/INT8 weight quantization via a new packing-agnostic quant/
framework.

The quant/ package separates quantization into four concerns:

  • recipe.py: declarative QuantRecipe with regex FQN matching
  • quantize.py: produces CanonicalQuantizedWeight (min_max, HQQ)
  • serialize.py: save/load to safetensors with versioned headers
  • pack.py + pack_cuda.py: per-module packer dispatch for CUDA

This concept will be promoted and used for Qwen3_5_moe and other desktop/laptop models.

Two production recipes: "default" (INT4 min_max + INT8 embedding) and
"sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ elsewhere).

Sliding window attention uses a ring-buffer KV cache (2x window size)
for the 50 sliding layers, saving memory for long sequences. The 10
full-attention layers use a standard flat KV cache.

Includes C++ runner (main.cpp), eager inference script, and 60+ unit
and integration tests across quant/ and pipeline test files.

Uses this model: https://huggingface.co/SocialLocalMobile/gemma-4-31B-it-HQQ-INT4

…uTorch

Text-only export of Gemma 4 31B-IT to ExecuTorch with the CUDA backend
and INT4/INT8 weight quantization via a new packing-agnostic quant/
framework.

The quant/ package separates quantization into four concerns:
  - recipe.py: declarative QuantRecipe with regex FQN matching
  - quantize.py: produces CanonicalQuantizedWeight (min_max, HQQ)
  - serialize.py: save/load to safetensors with versioned headers
  - pack.py + pack_cuda.py: per-module packer dispatch for CUDA

Two production recipes: "default" (INT4 min_max + INT8 embedding) and
"sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ elsewhere).

Sliding window attention uses a ring-buffer KV cache (2x window size)
for the 50 sliding layers, saving memory for long sequences. The 10
full-attention layers use a standard flat KV cache.

Includes C++ runner (main.cpp), eager inference script, and 60+ unit
and integration tests across quant/ and pipeline test files.
Copilot AI review requested due to automatic review settings April 29, 2026 21:06
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 29, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19213

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ 1 Pending, 2 Unrelated Failures

As of commit 9108a5b with merge base d8da621 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Gemma 4 31B-IT example pipeline for ExecuTorch (CUDA backend), including a packing-agnostic quantization format + recipes, CUDA packers, export/inference scripts, a C++ runner, and CI coverage.

Changes:

  • Introduces examples/models/gemma4_31b/quant/ with recipe → quantize → serialize → pack flow plus unit tests.
  • Adds Gemma 4 31B model implementation with hybrid attention and a sliding-window KV cache, plus export + eager inference entrypoints.
  • Adds CUDA runner build targets and runs Gemma 4 31B tests in the CUDA GitHub Actions workflow.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
examples/models/gemma4_31b/test_pipeline.py CPU-only integration tests for quantize/save/load roundtrip and tiny checkpoint fixtures.
examples/models/gemma4_31b/test_cuda_pipeline.py CUDA integration tests for pack/infer/export on a tiny model.
examples/models/gemma4_31b/sampler.py GPU-side Gumbel-max sampler used by the exported model.
examples/models/gemma4_31b/quantize_and_save.py CLI to quantize HF checkpoints and write packing-agnostic safetensors bundles + production recipes.
examples/models/gemma4_31b/quant/test_serialize.py Unit tests for nibble packing and safetensors serialization format.
examples/models/gemma4_31b/quant/test_recipe.py Unit tests for regex/layer-filter recipe matching + production recipe regression tests.
examples/models/gemma4_31b/quant/test_quantize.py Unit tests for quantize_weight and quantize_model APIs (CPU + CUDA/HQQ paths).
examples/models/gemma4_31b/quant/test_pack_cuda.py CUDA unit tests for int4/int8 packers and load-and-pack dispatcher behavior.
examples/models/gemma4_31b/quant/serialize.py Canonical quantized weight format + safetensors save/load with versioned metadata.
examples/models/gemma4_31b/quant/recipe.py Declarative quantization recipe/rule objects with regex FQN matching and optional layer filters.
examples/models/gemma4_31b/quant/quantize.py Implements min-max and HQQ quantization into canonical (packing-free) representations.
examples/models/gemma4_31b/quant/pack_cuda.py CUDA-specific packers converting canonical weights into torchao runtime tensor subclasses.
examples/models/gemma4_31b/quant/pack.py Backend-agnostic pack dispatcher that assigns weights/buffers and calls module-type packers.
examples/models/gemma4_31b/quant/init.py Public API re-exports for quant/ package.
examples/models/gemma4_31b/quant/README.md Documentation of the quant framework, data flow, and backend extension points.
examples/models/gemma4_31b/model.py Gemma 4 31B model definition, HF checkpoint loader, ring KV cache for sliding layers, runtime buffer materialization.
examples/models/gemma4_31b/model.md Architecture/design notes for model + quant pipeline.
examples/models/gemma4_31b/main.cpp ExecuTorch CUDA runner driving exported prefill/decode and HF tokenizer decoding.
examples/models/gemma4_31b/inference.py Eager CUDA inference script loading prequantized weights, packing, and generating text.
examples/models/gemma4_31b/export.py Export + lowering pipeline (decode + prefill methods) targeting the CUDA backend.
examples/models/gemma4_31b/init.py Package marker for the new model example.
examples/models/gemma4_31b/README.md User-facing instructions for quantize/export/inference/build/run workflows.
examples/models/gemma4_31b/CMakePresets.json CMake preset for building the Gemma 4 31B CUDA runner.
examples/models/gemma4_31b/CMakeLists.txt CMake build for the Gemma 4 31B runner, linking ExecuTorch + CUDA backend + tokenizer.
examples/models/gemma4/text_decoder/gemma4_norm.py Replaces transformers RMSNorm dependency with a self-contained implementation.
examples/models/gemma4/text_decoder/init.py Exposes attention/norm/MLP primitives used by gemma4_31b for shared numerically-sensitive ops.
Makefile Adds gemma4_31b-cuda build target.
.github/workflows/cuda.yml Adds Gemma 4 31B quant + pipeline tests to the CUDA CI job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/gemma4_31b/model.py
Comment thread examples/models/gemma4_31b/model.py
Comment thread examples/models/gemma4_31b/quant/pack.py Outdated
Comment thread examples/models/gemma4_31b/test_pipeline.py Outdated
Comment thread examples/models/gemma4_31b/quant/recipe.py Outdated
- Sliding window layers use RingKVCache (2×window) instead of flat
  max_seq_len buffer, reducing KV cache memory for long sequences.
- Prefill is capped to ring buffer size; the C++ runner chunks longer
  prompts automatically via get_max_prefill_chunk metadata.
- Both recipes now quantize embed_tokens to INT8 per-axis (~1.4 GB
  savings vs bf16). Embedding packer uses IntxUnpackedToInt8Tensor
  which supports gather.
- pack_model handles top-level FQNs (no parent module).
- C++ runner aligned with Qwen patterns: #ifdef guards for non-CUDA
  builds, better weight_sharing error handling, cudaDeviceSynchronize
  between prefill and decode.
- Test suite split into test_pipeline.py (CPU) and test_cuda_pipeline.py
  (CUDA) with shared fixtures. New chunked prefill correctness test.
- Prequantized checkpoint available at
  huggingface.co/SocialLocalMobile/gemma-4-31B-it-HQQ-INT4.
- Added Gemma 4 31B tests to cuda.yml CI workflow.
- Cleaned up stale terminology, docstrings, and comments throughout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants