Eval bug: Gemma 4: hardcoded `</s>` stops generation

### Name and Version

./llama-server --version
version: 39 (c08d28d08)
built with GNU 10.2.1 for Linux x86_64


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

2x RTX 4090

### Models

Google Gemma 4 31B-it [UD-Q6_K_XL](https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-Q6_K_XL.gguf) (SHA256: a8f8..2d06)

### Problem description & steps to reproduce

## Description

1. For Gemma 4, generation stops when the model outputs `</s>`.
2. According to the Gemma 4 [`tokenizer.json`](https://huggingface.co/google/gemma-4-31B-it/blob/main/tokenizer.json), `"</s>": 212` is not a special token. It is a normal text token that can appear as an HTML tag.
3. However, `llama.cpp` treats `</s>` as EOG token because of a hardcoded check in `src/llama-vocab.cpp`: `|| t.first == "</s>" // paddleocr`.
4. Related observation: according to the Gemma 4 [`generation_config.json`](https://huggingface.co/google/gemma-4-31B-it/blob/main/generation_config.json), `<eos>` (token ID 1) is listed in `eos_token_id`, but it is not added to EOG.

## Steps to Reproduce

1. Run `llama-server` with a Gemma 4 model.
2. Send a request asking the model to generate HTML containing the `<s>` tag:

```bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Generate a simple HTML snippet that contains the tag <s>. Output only the HTML code."
      }
    ]
  }'
```

3. Observe that the response is truncated when the model generates `</s>`, and `finish_reason` is `"stop"`.

<details>
<summary>Output</summary>

```json
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "reasoning_content": "*   Goal: Generate a simple HTML snippet.\n    *   Constraint 1: Must contain the `<s>` tag.\n    *   Constraint 2: Output *only* the HTML code.\n\n    *   The `<s>` tag represents text that is no longer correct, accurate, or relevant (strikethrough).\n    *   Example: `<s>Old price: $10"
            }
        }
    ],
    "created": 1775386179,
    "model": "Gemma-4-31B-it",
    "system_fingerprint": "b39-c08d28d08",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 87,
        "prompt_tokens": 33,
        "total_tokens": 120,
        "prompt_tokens_details": { "cached_tokens": 0 }
    },
    "id": "chatcmpl-UgIRuWn37sS8AbWI8FLjtWOru2YlBaJo",
    "timings": {
        "cache_n": 0,
        "prompt_n": 33,
        "prompt_ms": 66.612,
        "prompt_per_token_ms": 2.018545454545454,
        "prompt_per_second": 495.4062331111512,
        "predicted_n": 87,
        "predicted_ms": 2759.828,
        "predicted_per_token_ms": 31.72216091954023,
        "predicted_per_second": 31.5237036510971
    }
}
```
</details>


### First Bad Commit

_No response_

### Relevant log output

## Startup logs

What first raised suspicion was the following `llama-server` startup log with Gemma 4:

```
load: 0 unused tokens
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special tokens cache size = 24
```

## Temporary local fix for Gemma 4

As a temporary local fix, we replaced: `|| t.first == "</s>"` with `|| t.first == "<eos>"` in `src/llama-vocab.cpp`.

## Startup logs with the local fix

```
load: 0 unused tokens
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load: special tokens cache size = 24
```

All Gemma 4 `eos_token_id` tokens are now present in the EOG list.

## Output after the local fix

The same request then returned the following output.

<details>
<summary>Output</summary>

```json
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "<s>This text is no longer accurate.</s>",
                "reasoning_content": "*   Task: Generate a simple HTML snippet.\n    *   Requirement: Must contain the `<s>` tag.\n    *   Constraint: Output *only* the HTML code.\n\n    *   The `<s>` tag represents text that is no longer accurate or relevant (strikethrough).\n    *   Example: `<s>Old Price: $50</s> New Price: $40`\n\n    *   `<s>This text is strikethrough.</s>`"
            }
        }
    ],
    "created": 1775387279,
    "model": "Gemma-4-31B-it",
    "system_fingerprint": "b39-c08d28d08",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 114,
        "prompt_tokens": 33,
        "total_tokens": 147,
        "prompt_tokens_details": { "cached_tokens": 0 }
    },
    "id": "chatcmpl-guFL2BjLrPPorXi0zSGqIKMPVyr47Dbn",
    "timings": {
        "cache_n": 0,
        "prompt_n": 33,
        "prompt_ms": 120.469,
        "prompt_per_token_ms": 3.6505757575757576,
        "prompt_per_second": 273.92939262382856,
        "predicted_n": 114,
        "predicted_ms": 3625.823,
        "predicted_per_token_ms": 31.8054649122807,
        "predicted_per_second": 31.441137639647607
    }
}
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma 4: hardcoded `</s>` stops generation #21471

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Description

Steps to Reproduce

First Bad Commit

Relevant log output

Startup logs

Temporary local fix for Gemma 4

Startup logs with the local fix

Output after the local fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Gemma 4: hardcoded </s> stops generation #21471

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Description

Steps to Reproduce

First Bad Commit

Relevant log output

Startup logs

Temporary local fix for Gemma 4

Startup logs with the local fix

Output after the local fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: Gemma 4: hardcoded `</s>` stops generation #21471