Skip to content

Commit 534569e

Browse files
committed
feat: add Gemma-4-26B-A4B local model support (llama.cpp)
Add Gemma-4 model documentation to Starlight docs and legacy docs: - Model command with optimized settings (UD-Q4_K_XL, 128K context) - Performance benchmarks (pp 395 tok/s, tg 40 tok/s on M1 Max) - Vision setup with mmproj download instructions - Quick reference table entries - Thinking mode tip Based on unsloth.ai llama.cpp guide for gemma-4-26B-A4B-it.
1 parent 29ae733 commit 534569e

2 files changed

Lines changed: 200 additions & 0 deletions

File tree

docs-site/src/content/docs/integrations/local-llms.mdx

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -405,6 +405,54 @@ llama-server \
405405
| UD-Q4\_K\_XL | ~18 GB | Good balance, recommended |
406406
| Q8\_0 | ~32 GB | Higher quality, 20--40% slower |
407407

408+
### Gemma-4-26B-A4B -- Google MoE with Vision
409+
410+
A 26B MoE model from Google with only 4B active
411+
parameters. Supports up to 256K context. Optionally
412+
supports vision via a multimodal projector (mmproj).
413+
414+
```bash
415+
llama-server \
416+
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
417+
--port 8132 \
418+
-c 131072 \
419+
-b 2048 \
420+
-ub 1024 \
421+
--parallel 1 \
422+
-fa on \
423+
--jinja \
424+
--temp 1.0 \
425+
--top-p 0.95 \
426+
--top-k 64
427+
```
428+
429+
**Key settings:**
430+
431+
| Setting | Why |
432+
|---------|-----|
433+
| `--temp 1.0` | Recommended by Google |
434+
| `--top-k 64` | Gemma-specific sampling parameter |
435+
| `-c 131072` | 128K context; Claude Code needs 20k+ for system prompt alone |
436+
| `-fa on` | Flash attention for faster prompt processing |
437+
438+
**Performance (M1 Max 64 GB, ~37K input tokens):**
439+
440+
- Cold start: pp 395 tok/s, tg 40 tok/s (96s total)
441+
- Cached follow-up: pp 110 tok/s, tg 40 tok/s (6s total)
442+
443+
| Quant | Size | Notes |
444+
|-------|------|-------|
445+
| UD-Q4\_K\_XL | ~16 GB | Recommended, fits comfortably on 64 GB systems |
446+
447+
:::tip[Thinking Mode]
448+
Enable thinking by prepending `<|think|>` to the
449+
system prompt. The model outputs reasoning in
450+
`<|channel>thought...<channel|>` tags before the
451+
final answer. For multi-turn conversations, only
452+
feed visible answers back -- exclude prior thought
453+
blocks.
454+
:::
455+
408456
## Quick Reference
409457

410458
| Model | Port | Command |
@@ -418,6 +466,7 @@ llama-server \
418466
| GLM-4.7-Flash | 8129 | See full command above |
419467
| Qwen3-Coder-Next | 8130 | See full command above (~46 GB) |
420468
| Qwen3.5-35B-A3B | 8131 | See full command above (needs `--swa-full`) |
469+
| Gemma-4-26B-A4B | 8132 | See full command above |
421470

422471
## Vision Models
423472

@@ -482,6 +531,60 @@ plus a multimodal projector (mmproj).
482531

483532
</Steps>
484533

534+
### Gemma-4-26B-A4B Vision Setup
535+
536+
Gemma-4 also supports vision via a BF16 multimodal
537+
projector.
538+
539+
<Steps>
540+
541+
1. **Download the mmproj file** (one-time):
542+
543+
```bash
544+
mkdir -p ~/models
545+
hf download \
546+
unsloth/gemma-4-26B-A4B-it-GGUF \
547+
mmproj-BF16.gguf \
548+
--local-dir ~/models
549+
```
550+
551+
2. **Start the server** (port 8132):
552+
553+
```bash
554+
llama-server \
555+
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
556+
--mmproj ~/models/mmproj-BF16.gguf \
557+
--port 8132 \
558+
-c 32768 \
559+
-b 2048 \
560+
-ub 1024 \
561+
--parallel 1 \
562+
-fa on \
563+
--jinja \
564+
--temp 1.0 \
565+
--top-p 0.95 \
566+
--top-k 64
567+
```
568+
569+
3. **Add the provider** to `~/.codex/config.toml`:
570+
571+
```toml
572+
[model_providers.llama-8132]
573+
name = "Gemma-4 Vision"
574+
base_url = "http://localhost:8132/v1"
575+
wire_api = "chat"
576+
```
577+
578+
4. **Run Codex with an image:**
579+
580+
```bash
581+
codex --model gemma-4 \
582+
-c model_provider=llama-8132 \
583+
-i screenshot.png "describe this"
584+
```
585+
586+
</Steps>
587+
485588
## Troubleshooting
486589

487590
### "failed to find a memory slot" errors

docs/local-llm-setup.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,50 @@ llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
248248
| UD-Q4_K_XL | ~18 GB | Good balance, recommended |
249249
| Q8_0 | ~32 GB | Higher quality, 20-40% slower |
250250

251+
### Gemma-4-26B-A4B (Google MoE with Vision)
252+
253+
A 26B MoE model from Google with only 4B active parameters. Supports up to 256K
254+
context. Optionally supports vision via a multimodal projector (mmproj).
255+
256+
```bash
257+
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
258+
--port 8132 \
259+
-c 131072 \
260+
-b 2048 \
261+
-ub 1024 \
262+
--parallel 1 \
263+
-fa on \
264+
--jinja \
265+
--temp 1.0 \
266+
--top-p 0.95 \
267+
--top-k 64
268+
```
269+
270+
**Key settings:**
271+
272+
| Setting | Why |
273+
|---------|-----|
274+
| `--temp 1.0` | Recommended by Google |
275+
| `--top-k 64` | Gemma-specific sampling parameter |
276+
| `-c 131072` | 128K context; Claude Code needs 20k+ for system prompt alone |
277+
| `-fa on` | Enables flash attention for faster prompt processing |
278+
279+
**Performance (M1 Max 64 GB, ~37K input tokens):**
280+
281+
- Cold start: pp 395 tok/s, tg 40 tok/s (96s total)
282+
- Cached follow-up: pp 110 tok/s, tg 40 tok/s (6s total)
283+
284+
**Quantization options:**
285+
286+
| Quant | Size | Notes |
287+
|-------|------|-------|
288+
| UD-Q4_K_XL | ~16 GB | Recommended, fits comfortably on 64GB systems |
289+
290+
> **Thinking mode:** Enable thinking by prepending `<|think|>` to the system
291+
> prompt. The model outputs reasoning in `<|channel>thought...<channel|>` tags
292+
> before the final answer. For multi-turn conversations, only feed visible
293+
> answers back -- exclude prior thought blocks.
294+
251295
## Quick Reference
252296

253297
| Model | Port | Command |
@@ -259,6 +303,7 @@ llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
259303
| Qwen3-Coder-30B | 8127 | `llama-server --fim-qwen-30b-default --port 8127` |
260304
| Qwen3-Coder-Next | 8130 | See full command above (~46GB RAM) |
261305
| GLM-4.7-Flash | 8129 | See full command above (requires chat template) |
306+
| Gemma-4-26B-A4B | 8132 | See full command above |
262307

263308
## Usage
264309

@@ -426,8 +471,60 @@ Then run Codex with an image:
426471
codex --model qwen3-vl -c model_provider=llama-8128 -i screenshot.png "describe this"
427472
```
428473

474+
## Gemma-4-26B-A4B Vision Setup
475+
476+
Gemma-4 also supports vision via a BF16 multimodal projector.
477+
478+
**One-time setup** (download the mmproj file):
479+
480+
```bash
481+
just gemma4-download
482+
# Or manually:
483+
mkdir -p ~/models
484+
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
485+
mmproj-BF16.gguf \
486+
--local-dir ~/models
487+
```
488+
489+
**Start the server** (port 8132):
490+
491+
```bash
492+
just gemma4-vision
493+
# Or manually:
494+
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
495+
--mmproj ~/models/mmproj-BF16.gguf \
496+
--port 8132 \
497+
-c 131072 \
498+
-b 2048 \
499+
-ub 1024 \
500+
--parallel 1 \
501+
-fa on \
502+
--jinja \
503+
--temp 1.0 \
504+
--top-p 0.95 \
505+
--top-k 64
506+
```
507+
508+
**Use with Codex:**
509+
510+
First, add the provider to `~/.codex/config.toml`:
511+
512+
```toml
513+
[model_providers.llama-8132]
514+
name = "Gemma-4 Vision"
515+
base_url = "http://localhost:8132/v1"
516+
wire_api = "chat"
517+
```
518+
519+
Then run Codex with an image:
520+
521+
```bash
522+
codex --model gemma-4 -c model_provider=llama-8132 -i screenshot.png "describe this"
523+
```
524+
429525
## Quick Reference
430526

431527
| Model | Port | Command |
432528
|-------|------|---------|
433529
| Qwen3-VL-30B-A3B | 8128 | `just qwen3-vl` (after `just qwen3-vl-download`) |
530+
| Gemma-4-26B-A4B | 8132 | `just gemma4-vision` (after `just gemma4-download`) |

0 commit comments

Comments
 (0)