Skip to content

Eval bug: Gemma 4: hardcoded </s> stops generation #21471

@annagorshunova

Description

@annagorshunova

Name and Version

./llama-server --version
version: 39 (c08d28d)
built with GNU 10.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

2x RTX 4090

Models

Google Gemma 4 31B-it UD-Q6_K_XL (SHA256: a8f8..2d06)

Problem description & steps to reproduce

Description

  1. For Gemma 4, generation stops when the model outputs </s>.
  2. According to the Gemma 4 tokenizer.json, "</s>": 212 is not a special token. It is a normal text token that can appear as an HTML tag.
  3. However, llama.cpp treats </s> as EOG token because of a hardcoded check in src/llama-vocab.cpp: || t.first == "</s>" // paddleocr.
  4. Related observation: according to the Gemma 4 generation_config.json, <eos> (token ID 1) is listed in eos_token_id, but it is not added to EOG.

Steps to Reproduce

  1. Run llama-server with a Gemma 4 model.
  2. Send a request asking the model to generate HTML containing the <s> tag:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Generate a simple HTML snippet that contains the tag <s>. Output only the HTML code."
      }
    ]
  }'
  1. Observe that the response is truncated when the model generates </s>, and finish_reason is "stop".
Output
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "reasoning_content": "*   Goal: Generate a simple HTML snippet.\n    *   Constraint 1: Must contain the `<s>` tag.\n    *   Constraint 2: Output *only* the HTML code.\n\n    *   The `<s>` tag represents text that is no longer correct, accurate, or relevant (strikethrough).\n    *   Example: `<s>Old price: $10"
            }
        }
    ],
    "created": 1775386179,
    "model": "Gemma-4-31B-it",
    "system_fingerprint": "b39-c08d28d08",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 87,
        "prompt_tokens": 33,
        "total_tokens": 120,
        "prompt_tokens_details": { "cached_tokens": 0 }
    },
    "id": "chatcmpl-UgIRuWn37sS8AbWI8FLjtWOru2YlBaJo",
    "timings": {
        "cache_n": 0,
        "prompt_n": 33,
        "prompt_ms": 66.612,
        "prompt_per_token_ms": 2.018545454545454,
        "prompt_per_second": 495.4062331111512,
        "predicted_n": 87,
        "predicted_ms": 2759.828,
        "predicted_per_token_ms": 31.72216091954023,
        "predicted_per_second": 31.5237036510971
    }
}

First Bad Commit

No response

Relevant log output

Startup logs

What first raised suspicion was the following llama-server startup log with Gemma 4:

load: 0 unused tokens
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special tokens cache size = 24

Temporary local fix for Gemma 4

As a temporary local fix, we replaced: || t.first == "</s>" with || t.first == "<eos>" in src/llama-vocab.cpp.

Startup logs with the local fix

load: 0 unused tokens
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load: special tokens cache size = 24

All Gemma 4 eos_token_id tokens are now present in the EOG list.

Output after the local fix

The same request then returned the following output.

Output
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "<s>This text is no longer accurate.</s>",
                "reasoning_content": "*   Task: Generate a simple HTML snippet.\n    *   Requirement: Must contain the `<s>` tag.\n    *   Constraint: Output *only* the HTML code.\n\n    *   The `<s>` tag represents text that is no longer accurate or relevant (strikethrough).\n    *   Example: `<s>Old Price: $50</s> New Price: $40`\n\n    *   `<s>This text is strikethrough.</s>`"
            }
        }
    ],
    "created": 1775387279,
    "model": "Gemma-4-31B-it",
    "system_fingerprint": "b39-c08d28d08",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 114,
        "prompt_tokens": 33,
        "total_tokens": 147,
        "prompt_tokens_details": { "cached_tokens": 0 }
    },
    "id": "chatcmpl-guFL2BjLrPPorXi0zSGqIKMPVyr47Dbn",
    "timings": {
        "cache_n": 0,
        "prompt_n": 33,
        "prompt_ms": 120.469,
        "prompt_per_token_ms": 3.6505757575757576,
        "prompt_per_second": 273.92939262382856,
        "predicted_n": 114,
        "predicted_ms": 3625.823,
        "predicted_per_token_ms": 31.8054649122807,
        "predicted_per_second": 31.441137639647607
    }
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions