Skip to content

Eval bug: Regression: CUDA graphs props check optimization causes infinite generation with Gemma4 26B on long context #21640

@andy921921

Description

@andy921921

Name and Version

Problem

When running Gemma4 26B (CPU+CUDA) with long context, the model generates infinite garbage tokens (e.g., <unused> tokens and other malformed output). This regression was introduced by commit c5ce4bc22.

Environment

  • Model: google_gemma-4-26B-A4B-it-Q4_K_M.gguf
  • Backend: CPU + CUDA (RTX 3090, sm_86)
  • OS: Windows 10
  • CUDA: v13.1
  • Build: Release, AVX2, CUDA graphs enabled (GGML_CUDA_USE_GRAPHS)

Reproduction

Launch with:

llama-server -m google_gemma-4-26B-A4B-it-Q4_K_M.gguf -fa on --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 -c 32768
edd4d9bca | vulkan FA dequant | ✅ OK
de1aa6fa7 | CUDA buffer overlap | ✅ OK
66c4f9ded | CUDA mmq kernels | ✅ OK
4eb19514d | kv-cache iSWA | ✅ OK
5764d7c6a | gemma per-layer projections | ✅ OK
3ba12fed0 | kv-cache FA quantization checks | ✅ OK
d9a12c82f | vocab: remove </s> eog for gemma4 | ✅ OK
0d049d6a9 | unicode Qwen2 regex | ✅ OK
69c28f154 | llama-server model params | ✅ OK
c5ce4bc22 | CUDA graphs props check | ❌ BREAKS
660600081 | server ignore eos flag | ✅ OK (after reverting c5ce4bc22)


### Operating systems

Windows

### GGML backends

CUDA

### Hardware

cpu+gpu

### Models

_No response_

### Problem description & steps to reproduce

As an user, I've, **I'm, VPN-configuration is not a wonderful, I'm, ** la, I'm, **, We', la

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>
<!-- Copy-pasted short logs go into the "console" area here -->

```console

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions