You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gemma 4 models generate an infinite stream of <unused> tokens (Token ID 14 = <unused8>) on the Vulkan backend, both with GPU offloading and CPU-only. No valid text is produced — the model runs until MaxTokens is exhausted.
This happens despite having all known Gemma 4 fixes applied:
Bug Description
Gemma 4 models generate an infinite stream of
<unused>tokens (Token ID 14 =<unused8>) on the Vulkan backend, both with GPU offloading and CPU-only. No valid text is produced — the model runs until MaxTokens is exhausted.This happens despite having all known Gemma 4 fixes applied:
Environment
94ca829b6(master, 2026-04-06) + PR models : set gemma 4 FFN MoE prec to F32 #21506 patchgemma-4-E2B-it-Q4_K_M.gguffrom unsloth/gemma-4-E2B-it-GGUFGGML_VULKAN=ONSteps to Reproduce
gemma-4-E2B-it-Q4_K_M.ggufwith-ngl 99<unused8>(token id=14) without producing any readable text or hitting EOGDiagnostic Data
Token sampling output (first 10 tokens):
Token 14 in Gemma 4 vocab =
<unused8>.Generation stats: 183 tok/s, 18432 tokens generated, 44.7 seconds — no EOG token emitted.
Additional Testing
-ngl 0): Same result — generates[multimodal]tokens (id=5) in an infinite loop. Also broken.Init Logs (successful)
Model loads correctly, context/sampler/batch all initialized — the issue is purely in inference/sampling.
Related Issues
<unused24>tokens (CUDA, partially fixed by models : set gemma 4 FFN MoE prec to F32 #21506)The root cause may be Vulkan-specific numerical precision issues beyond what #21506 addresses, or a different code path in the Vulkan compute shaders.