Eval bug: Gemma 4 generates <unused> tokens in infinite loop

## Bug Description

Gemma 4 models generate an infinite stream of `<unused>` tokens (Token ID 14 = `<unused8>`) on the **Vulkan backend**, both with GPU offloading and CPU-only. No valid text is produced — the model runs until MaxTokens is exhausted.

This happens **despite** having all known Gemma 4 fixes applied:
- Tokenizer fix (#21343)
- Template parser fix (#21326)
- Custom newline split (#21406)
- Byte token handling (#21488)
- Logit softcapping (#21390)
- **Plus** the F32 MoE precision patch from PR #21506 (manually applied)

## Environment

- **OS:** Windows 11 Pro (10.0.26100)
- **GPU:** NVIDIA (24 GB VRAM), Vulkan backend
- **llama.cpp:** Built from `94ca829b6` (master, 2026-04-06) + PR #21506 patch
- **Model:** `gemma-4-E2B-it-Q4_K_M.gguf` from [unsloth/gemma-4-E2B-it-GGUF](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF)
- **Build:** CMake + MSVC + `GGML_VULKAN=ON`

## Steps to Reproduce

1. Build llama.cpp from current master with Vulkan enabled
2. Load `gemma-4-E2B-it-Q4_K_M.gguf` with `-ngl 99`
3. Send any prompt (e.g., "Hello")
4. Observe: model generates ~18000+ tokens of `<unused8>` (token id=14) without producing any readable text or hitting EOG

## Diagnostic Data

Token sampling output (first 10 tokens):
```
Token[0] id=14
Token[1] id=14
Token[2] id=14
Token[3] id=14
Token[4] id=14
Token[5] id=14
Token[6] id=14
Token[7] id=14
Token[8] id=14
Token[9] id=14
```

Token 14 in Gemma 4 vocab = `<unused8>`.

Generation stats: 183 tok/s, 18432 tokens generated, 44.7 seconds — no EOG token emitted.

## Additional Testing

- **CPU-only (`-ngl 0`):** Same result — generates `[multimodal]` tokens (id=5) in an infinite loop. Also broken.
- **PR #21506 applied (F32 MoE FFN precision):** No improvement on either CPU or Vulkan.
- **Ollama:** Same model works correctly in Ollama (which uses its own llama.cpp fork), producing valid responses.

## Init Logs (successful)

```
Model handle: OK
Vocab size: 262144
Layers: 35
Context size: 32768
Gemma tokens: start_of_turn=105, end_of_turn=106, bos=2
System tokens: 11
Initialization complete!
```

Model loads correctly, context/sampler/batch all initialized — the issue is purely in inference/sampling.

## Related Issues

- #21321 — Gemma 4 generates `<unused24>` tokens (CUDA, partially fixed by #21506)
- #21343 — Gemma 4 tokenizer fix (merged, applied here)
- #21506 — F32 MoE precision (open, manually applied, does not fix this)

The root cause may be Vulkan-specific numerical precision issues beyond what #21506 addresses, or a different code path in the Vulkan compute shaders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma 4 generates <unused> tokens in infinite loop #21516

Bug Description

Environment

Steps to Reproduce

Diagnostic Data

Additional Testing

Init Logs (successful)

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Gemma 4 generates <unused> tokens in infinite loop #21516

Description

Bug Description

Environment

Steps to Reproduce

Diagnostic Data

Additional Testing

Init Logs (successful)

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions