|
| 1 | +# Gemma 4 on ExecuTorch |
| 2 | + |
| 3 | +Multimodal inference for Gemma 4 on ExecuTorch. |
| 4 | +Supports audio transcription, translation, image understanding, and text generation on mobile devices. |
| 5 | + |
| 6 | +Variants: E2B (2B params) and E4B (4B params). |
| 7 | + |
| 8 | +## Architecture |
| 9 | + |
| 10 | +Single PTE with up to 4 methods: |
| 11 | +- `speech_transform` — Waveform to log-mel spectrogram (no learned weights) |
| 12 | +- `audio_encoder` — USM Conformer via HF's Gemma4AudioModel |
| 13 | +- `vision_encoder` — ViT with 2D RoPE via HF's Gemma4VisionModel (8-bit, int8 position embeddings) |
| 14 | +- `text_decoder` — Autoregressive decoder with YOCO, PLE, partial RoPE |
| 15 | + |
| 16 | +Use `--no-audio` or `--no-vision` at export time to exclude unused encoders. |
| 17 | + |
| 18 | +| | E2B | E4B | |
| 19 | +|---|---|---| |
| 20 | +| Hidden size | 1536 | 2560 | |
| 21 | +| Layers | 35 | 42 | |
| 22 | +| KV heads | 1 (MQA) | 2 | |
| 23 | + |
| 24 | +## Export |
| 25 | + |
| 26 | +```bash |
| 27 | +# E2B default (4-bit text, 8-bit vision, all modalities): |
| 28 | +buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \ |
| 29 | + --checkpoint_path /tmp/gemma4-e2b-it |
| 30 | + |
| 31 | +# E2B 4-bit with tied embedding (smaller, for on-device deployment): |
| 32 | +buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \ |
| 33 | + --checkpoint_path /tmp/gemma4-e2b-it --tied_embedding |
| 34 | + |
| 35 | +# E4B (4-bit): |
| 36 | +buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \ |
| 37 | + --checkpoint_path /tmp/gemma4-e4b-it --variant e4b |
| 38 | + |
| 39 | +# Audio-only (no vision encoder, saves ~129 MB): |
| 40 | +buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \ |
| 41 | + --checkpoint_path /tmp/gemma4-e2b-it --no-vision |
| 42 | + |
| 43 | +# Vision-only (no audio encoder, saves ~100 MB): |
| 44 | +buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \ |
| 45 | + --checkpoint_path /tmp/gemma4-e2b-it --no-audio |
| 46 | +``` |
| 47 | + |
| 48 | +## Model Variants |
| 49 | + |
| 50 | +Default export includes all modalities (audio + vision + text). Default context length: 1024 tokens (`--max_seq_len`). |
| 51 | + |
| 52 | +### Pre-exported Models |
| 53 | + |
| 54 | +**E2B:** |
| 55 | + |
| 56 | +| File | Size | Config | Description | |
| 57 | +|------|------|--------|-------------| |
| 58 | +| `gemma4.pte` | 4.1 GB | 4-bit, audio-only | Default — fastest | |
| 59 | +| `gemma4_vision.pte` | 4.3 GB | 4-bit, all modalities | Audio + vision + text | |
| 60 | +| `gemma4_tied_emb4.pte` | 2.5 GB | 4-bit tied + emb4, audio-only | Smallest | |
| 61 | + |
| 62 | +**E4B:** |
| 63 | + |
| 64 | +| File | Size | Config | Description | |
| 65 | +|------|------|--------|-------------| |
| 66 | +| `gemma4.pte` | 6.1 GB | 4-bit, audio-only | Default — fastest | |
| 67 | +| `gemma4_vision.pte` | 6.2 GB | 4-bit, all modalities | Audio + vision + text | |
| 68 | +| `gemma4_tied_emb4.pte` | 4.0 GB | 4-bit tied + emb4, audio-only | Smallest | |
| 69 | + |
| 70 | +### Export Flags |
| 71 | + |
| 72 | +| Variant | Size | Flag | |
| 73 | +|---------|------|------| |
| 74 | +| E2B 4-bit (default) | 4.3 GB | (none) | |
| 75 | +| E2B 4-bit audio-only | 4.1 GB | `--no-vision` | |
| 76 | +| E2B 4-bit emb4 tied | 2.5 GB | `--quantize 8da4w+emb4 --tied_embedding --no-vision` | |
| 77 | +| E4B 4-bit | 6.2 GB | `--variant e4b` | |
| 78 | +| E4B 4-bit audio-only | 6.1 GB | `--variant e4b --no-vision` | |
| 79 | +| E4B 4-bit emb4 tied | 4.0 GB | `--variant e4b --quantize 8da4w+emb4 --tied_embedding --no-vision` | |
| 80 | + |
| 81 | +Vision encoder adds ~129 MB (8-bit linears + int8 position embedding table). |
| 82 | + |
| 83 | +- **Untied models** (`gemma4.pte`, `gemma4_vision.pte`) work with both Python and C++ runners. |
| 84 | +- **emb4 tied** uses packed INT4 embeddings and shared embed_tokens/lm_head weights. Requires C++ runner with TorchAO shared embedding kernels. |
| 85 | + |
| 86 | +## Build (CMake, host) |
| 87 | + |
| 88 | +```bash |
| 89 | +cmake --preset gemma4-cpu -S examples/models/gemma4 |
| 90 | +cmake --build --preset gemma4-cpu -j$(nproc) |
| 91 | +``` |
| 92 | + |
| 93 | +## Run |
| 94 | + |
| 95 | +```bash |
| 96 | +# Audio transcription (C++ runner): |
| 97 | +./cmake-out/examples/models/gemma4/gemma4_e2e_runner \ |
| 98 | + --model_path gemma4.pte \ |
| 99 | + --tokenizer_path tokenizer.model \ |
| 100 | + --audio_path test_audio.wav |
| 101 | + |
| 102 | +# Image understanding (C++ runner): |
| 103 | +./cmake-out/examples/models/gemma4/gemma4_e2e_runner \ |
| 104 | + --model_path gemma4.pte \ |
| 105 | + --tokenizer_path tokenizer.model \ |
| 106 | + --image_path photo.jpg \ |
| 107 | + --prompt "Describe this image:" |
| 108 | + |
| 109 | +# Text-only: |
| 110 | +./cmake-out/examples/models/gemma4/gemma4_e2e_runner \ |
| 111 | + --model_path gemma4.pte \ |
| 112 | + --tokenizer_path tokenizer.model \ |
| 113 | + --prompt "What is 2+2?" |
| 114 | + |
| 115 | +# Python runner (audio): |
| 116 | +buck2 run fbcode//executorch/examples/models/gemma4:run_gemma4 -- \ |
| 117 | + --model_path /tmp/gemma4.pte \ |
| 118 | + --tokenizer_path /tmp/tokenizer.model \ |
| 119 | + --audio_path /tmp/test_audio.wav |
| 120 | + |
| 121 | +# Python runner (image): |
| 122 | +buck2 run fbcode//executorch/examples/models/gemma4:run_gemma4 -- \ |
| 123 | + --model_path /tmp/gemma4.pte \ |
| 124 | + --tokenizer_path /tmp/tokenizer.model \ |
| 125 | + --image_path /tmp/photo.jpg \ |
| 126 | + --prompt "Describe this image:" |
| 127 | +``` |
| 128 | + |
| 129 | +## Input Requirements |
| 130 | + |
| 131 | +**Audio**: WAV, 16kHz, 16-bit PCM, mono, max 30 seconds. |
| 132 | + |
| 133 | +**Image**: JPEG or PNG. Resized to fit `--max_vision_tokens` soft tokens (default 140). Aspect ratio preserved, dimensions rounded to multiples of 48 pixels. Lower tokens = faster but less detail (25 ~= 240x240, 70 ~= 384x384, 140 ~= 528x528, 280 ~= 768x768). |
| 134 | + |
| 135 | +## Samsung S25 Performance |
| 136 | + |
| 137 | +### Audio (23s) |
| 138 | + |
| 139 | +| Model | Size | Load | Prefill | Gen | TTFT | RTF | Mem load | Mem peak | |
| 140 | +|-------|------|------|---------|-----|------|-----|----------|----------| |
| 141 | +| E2B gemma4.pte | 4.1 GB | 705ms | 166 tok/s | 6 tok/s | 4.50s | 0.71 | 1885 MB | 2251 MB | |
| 142 | +| E2B gemma4_vision.pte | 4.3 GB | 648ms | 163 tok/s | 6 tok/s | 4.56s | 0.72 | 1890 MB | 2257 MB | |
| 143 | +| E2B gemma4_tied_emb4.pte | 2.5 GB | 645ms | 164 tok/s | 6 tok/s | 4.52s | 0.71 | 1683 MB | 2241 MB | |
| 144 | +| E4B gemma4.pte | 6.1 GB | 1.30s | 91 tok/s | 4 tok/s | 7.50s | 1.07 | 3231 MB | 3601 MB | |
| 145 | +| E4B gemma4_vision.pte | 6.2 GB | 1.28s | 92 tok/s | 4 tok/s | 7.47s | 1.00 | 3231 MB | 3602 MB | |
| 146 | +| E4B gemma4_tied_emb4.pte | 4.0 GB | 1.17s | 85 tok/s | 4 tok/s | 8.00s | 1.07 | 2899 MB | 3590 MB | |
| 147 | + |
| 148 | +### Vision (dog.jpg, "Describe this image in two sentences.", 140 tokens ~528x528) |
| 149 | + |
| 150 | +| Model | Size | Load | Encode | Prefill | Gen | TTFT | Total | Mem load | Mem peak | |
| 151 | +|-------|------|------|--------|---------|-----|------|-------|----------|----------| |
| 152 | +| E2B gemma4_vision.pte | 4.3 GB | 798ms | 2.73s | 134 tok/s | 6 tok/s | 3.83s | 10.14s | 1884 MB | 2600 MB | |
| 153 | +| E4B gemma4_vision.pte | 6.2 GB | 1.36s | 2.44s | 85 tok/s | 4 tok/s | 4.17s | 14.62s | 3232 MB | 3950 MB | |
| 154 | + |
| 155 | +### Text ("Write a short paragraph about the history of artificial intelligence") |
| 156 | + |
| 157 | +| Model | Size | Load | Prefill | Gen | TTFT | Total | Mem load | Mem peak | |
| 158 | +|-------|------|------|---------|-----|------|-------|----------|----------| |
| 159 | +| E2B gemma4.pte | 4.1 GB | 625ms | 57 tok/s | 6 tok/s | 332ms | 26.94s | 1890 MB | 1950 MB | |
| 160 | +| E4B gemma4.pte | 6.1 GB | 1.51s | 38 tok/s | 3 tok/s | 506ms | 44.66s | 3231 MB | 3287 MB | |
0 commit comments