Name and Version
Problem
When running Gemma4 26B (CPU+CUDA) with long context, the model generates infinite garbage tokens (e.g., <unused> tokens and other malformed output). This regression was introduced by commit c5ce4bc22.
Environment
- Model:
google_gemma-4-26B-A4B-it-Q4_K_M.gguf
- Backend: CPU + CUDA (RTX 3090, sm_86)
- OS: Windows 10
- CUDA: v13.1
- Build: Release, AVX2, CUDA graphs enabled (
GGML_CUDA_USE_GRAPHS)
Reproduction
Launch with:
llama-server -m google_gemma-4-26B-A4B-it-Q4_K_M.gguf -fa on --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 -c 32768
edd4d9bca | vulkan FA dequant | ✅ OK
de1aa6fa7 | CUDA buffer overlap | ✅ OK
66c4f9ded | CUDA mmq kernels | ✅ OK
4eb19514d | kv-cache iSWA | ✅ OK
5764d7c6a | gemma per-layer projections | ✅ OK
3ba12fed0 | kv-cache FA quantization checks | ✅ OK
d9a12c82f | vocab: remove </s> eog for gemma4 | ✅ OK
0d049d6a9 | unicode Qwen2 regex | ✅ OK
69c28f154 | llama-server model params | ✅ OK
c5ce4bc22 | CUDA graphs props check | ❌ BREAKS
660600081 | server ignore eos flag | ✅ OK (after reverting c5ce4bc22)
### Operating systems
Windows
### GGML backends
CUDA
### Hardware
cpu+gpu
### Models
_No response_
### Problem description & steps to reproduce
As an user, I've, **I'm, VPN-configuration is not a wonderful, I'm, ** la, I'm, **, We', la
### First Bad Commit
_No response_
### Relevant log output
<details>
<summary>Logs</summary>
<!-- Copy-pasted short logs go into the "console" area here -->
```console
Name and Version
Problem
When running Gemma4 26B (CPU+CUDA) with long context, the model generates infinite garbage tokens (e.g.,
<unused>tokens and other malformed output). This regression was introduced by commitc5ce4bc22.Environment
google_gemma-4-26B-A4B-it-Q4_K_M.ggufGGML_CUDA_USE_GRAPHS)Reproduction
Launch with: