Name and Version
./llama-server --version
version: 39 (c08d28d)
built with GNU 10.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
2x RTX 4090
Models
Google Gemma 4 31B-it UD-Q6_K_XL (SHA256: a8f8..2d06)
Problem description & steps to reproduce
Description
- For Gemma 4, generation stops when the model outputs
</s>.
- According to the Gemma 4
tokenizer.json, "</s>": 212 is not a special token. It is a normal text token that can appear as an HTML tag.
- However,
llama.cpp treats </s> as EOG token because of a hardcoded check in src/llama-vocab.cpp: || t.first == "</s>" // paddleocr.
- Related observation: according to the Gemma 4
generation_config.json, <eos> (token ID 1) is listed in eos_token_id, but it is not added to EOG.
Steps to Reproduce
- Run
llama-server with a Gemma 4 model.
- Send a request asking the model to generate HTML containing the
<s> tag:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Generate a simple HTML snippet that contains the tag <s>. Output only the HTML code."
}
]
}'
- Observe that the response is truncated when the model generates
</s>, and finish_reason is "stop".
Output
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "",
"reasoning_content": "* Goal: Generate a simple HTML snippet.\n * Constraint 1: Must contain the `<s>` tag.\n * Constraint 2: Output *only* the HTML code.\n\n * The `<s>` tag represents text that is no longer correct, accurate, or relevant (strikethrough).\n * Example: `<s>Old price: $10"
}
}
],
"created": 1775386179,
"model": "Gemma-4-31B-it",
"system_fingerprint": "b39-c08d28d08",
"object": "chat.completion",
"usage": {
"completion_tokens": 87,
"prompt_tokens": 33,
"total_tokens": 120,
"prompt_tokens_details": { "cached_tokens": 0 }
},
"id": "chatcmpl-UgIRuWn37sS8AbWI8FLjtWOru2YlBaJo",
"timings": {
"cache_n": 0,
"prompt_n": 33,
"prompt_ms": 66.612,
"prompt_per_token_ms": 2.018545454545454,
"prompt_per_second": 495.4062331111512,
"predicted_n": 87,
"predicted_ms": 2759.828,
"predicted_per_token_ms": 31.72216091954023,
"predicted_per_second": 31.5237036510971
}
}
First Bad Commit
No response
Relevant log output
Startup logs
What first raised suspicion was the following llama-server startup log with Gemma 4:
load: 0 unused tokens
load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 50 ('<|tool_response>')
load: - 106 ('<turn|>')
load: - 212 ('</s>')
load: special tokens cache size = 24
Temporary local fix for Gemma 4
As a temporary local fix, we replaced: || t.first == "</s>" with || t.first == "<eos>" in src/llama-vocab.cpp.
Startup logs with the local fix
load: 0 unused tokens
load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token: 1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 1 ('<eos>')
load: - 50 ('<|tool_response>')
load: - 106 ('<turn|>')
load: special tokens cache size = 24
All Gemma 4 eos_token_id tokens are now present in the EOG list.
Output after the local fix
The same request then returned the following output.
Output
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "<s>This text is no longer accurate.</s>",
"reasoning_content": "* Task: Generate a simple HTML snippet.\n * Requirement: Must contain the `<s>` tag.\n * Constraint: Output *only* the HTML code.\n\n * The `<s>` tag represents text that is no longer accurate or relevant (strikethrough).\n * Example: `<s>Old Price: $50</s> New Price: $40`\n\n * `<s>This text is strikethrough.</s>`"
}
}
],
"created": 1775387279,
"model": "Gemma-4-31B-it",
"system_fingerprint": "b39-c08d28d08",
"object": "chat.completion",
"usage": {
"completion_tokens": 114,
"prompt_tokens": 33,
"total_tokens": 147,
"prompt_tokens_details": { "cached_tokens": 0 }
},
"id": "chatcmpl-guFL2BjLrPPorXi0zSGqIKMPVyr47Dbn",
"timings": {
"cache_n": 0,
"prompt_n": 33,
"prompt_ms": 120.469,
"prompt_per_token_ms": 3.6505757575757576,
"prompt_per_second": 273.92939262382856,
"predicted_n": 114,
"predicted_ms": 3625.823,
"predicted_per_token_ms": 31.8054649122807,
"predicted_per_second": 31.441137639647607
}
}
Name and Version
./llama-server --version
version: 39 (c08d28d)
built with GNU 10.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
2x RTX 4090
Models
Google Gemma 4 31B-it UD-Q6_K_XL (SHA256: a8f8..2d06)
Problem description & steps to reproduce
Description
</s>.tokenizer.json,"</s>": 212is not a special token. It is a normal text token that can appear as an HTML tag.llama.cpptreats</s>as EOG token because of a hardcoded check insrc/llama-vocab.cpp:|| t.first == "</s>" // paddleocr.generation_config.json,<eos>(token ID 1) is listed ineos_token_id, but it is not added to EOG.Steps to Reproduce
llama-serverwith a Gemma 4 model.<s>tag:</s>, andfinish_reasonis"stop".Output
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": "", "reasoning_content": "* Goal: Generate a simple HTML snippet.\n * Constraint 1: Must contain the `<s>` tag.\n * Constraint 2: Output *only* the HTML code.\n\n * The `<s>` tag represents text that is no longer correct, accurate, or relevant (strikethrough).\n * Example: `<s>Old price: $10" } } ], "created": 1775386179, "model": "Gemma-4-31B-it", "system_fingerprint": "b39-c08d28d08", "object": "chat.completion", "usage": { "completion_tokens": 87, "prompt_tokens": 33, "total_tokens": 120, "prompt_tokens_details": { "cached_tokens": 0 } }, "id": "chatcmpl-UgIRuWn37sS8AbWI8FLjtWOru2YlBaJo", "timings": { "cache_n": 0, "prompt_n": 33, "prompt_ms": 66.612, "prompt_per_token_ms": 2.018545454545454, "prompt_per_second": 495.4062331111512, "predicted_n": 87, "predicted_ms": 2759.828, "predicted_per_token_ms": 31.72216091954023, "predicted_per_second": 31.5237036510971 } }First Bad Commit
No response
Relevant log output
Startup logs
What first raised suspicion was the following
llama-serverstartup log with Gemma 4:Temporary local fix for Gemma 4
As a temporary local fix, we replaced:
|| t.first == "</s>"with|| t.first == "<eos>"insrc/llama-vocab.cpp.Startup logs with the local fix
All Gemma 4
eos_token_idtokens are now present in the EOG list.Output after the local fix
The same request then returned the following output.
Output
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": "<s>This text is no longer accurate.</s>", "reasoning_content": "* Task: Generate a simple HTML snippet.\n * Requirement: Must contain the `<s>` tag.\n * Constraint: Output *only* the HTML code.\n\n * The `<s>` tag represents text that is no longer accurate or relevant (strikethrough).\n * Example: `<s>Old Price: $50</s> New Price: $40`\n\n * `<s>This text is strikethrough.</s>`" } } ], "created": 1775387279, "model": "Gemma-4-31B-it", "system_fingerprint": "b39-c08d28d08", "object": "chat.completion", "usage": { "completion_tokens": 114, "prompt_tokens": 33, "total_tokens": 147, "prompt_tokens_details": { "cached_tokens": 0 } }, "id": "chatcmpl-guFL2BjLrPPorXi0zSGqIKMPVyr47Dbn", "timings": { "cache_n": 0, "prompt_n": 33, "prompt_ms": 120.469, "prompt_per_token_ms": 3.6505757575757576, "prompt_per_second": 273.92939262382856, "predicted_n": 114, "predicted_ms": 3625.823, "predicted_per_token_ms": 31.8054649122807, "predicted_per_second": 31.441137639647607 } }