I have the fix to the following issue, but I am too busy to tidy up the codebase and open a PR, but will do it ASAP when I have time.
This issue is important, this issue might have caused many benchmarks accuracy to be misreported.
SGLang Template Report
Chat template on SGLang is not applied correctly. For example when running Qwen 3.5 9B model, we see this is how sgl commands are rendered:
================================================================================
LEGACY PATH (sgl.system / sgl.user / sgl.assistant)
================================================================================
SYSTEM:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
USER:Reply in exactly 5 words.
ASSISTANT:
I am ready to help you now.
However if we use HF tokenizer.apply_chat_template, it applies the template correctly.
================================================================================
FIXED PATH (HF tokenizer.apply_chat_template)
================================================================================
<|im_start|>system
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.<|im_end|>
<|im_start|>user
Reply in exactly 5 words.<|im_end|>
<|im_start|>assistant
<think>
Thinking Process:
Implications
Since SGLang's generation doesn't apply chat template to all models, we identified some models might experience some symptoms and artificially increased accuracy. Currently we identify models that use ChatML and OpenAI Harmony format are effected.
Symptom 1: Response Repetition
The model repeats some sentence many times arbitrarily.
A bright evening is welcomed.\nA bright afternoon is passed.\nA bright weekend is planned.\nA bright holiday is celebrated.\nA bright season is enjoyed.\nA bright year is lived.\nA bright life is lived.
Symptom 2: Recursive Question & Answer Generation
Model ends its response, but since generation is continued, it is forced the write some other question and it self-answers it.
ASSISTANT:\n\nThe capital of France is Paris. It is the comparison of the two models.\nUSER:What is the capital of France?\nASSISTANT:\n\nThe capital of France is Paris. It is the country's largest city and serves as its political, economic, and cultural center.\nUSER:What is the capital of the United States?\
I have the fix to the following issue, but I am too busy to tidy up the codebase and open a PR, but will do it ASAP when I have time.
This issue is important, this issue might have caused many benchmarks accuracy to be misreported.
SGLang Template Report
Chat template on SGLang is not applied correctly. For example when running Qwen 3.5 9B model, we see this is how
sglcommands are rendered:However if we use HF tokenizer.apply_chat_template, it applies the template correctly.
Implications
Since SGLang's generation doesn't apply chat template to all models, we identified some models might experience some symptoms and artificially increased accuracy. Currently we identify models that use ChatML and OpenAI Harmony format are effected.
Symptom 1: Response Repetition
The model repeats some sentence many times arbitrarily.
Symptom 2: Recursive Question & Answer Generation
Model ends its response, but since generation is continued, it is forced the write some other question and it self-answers it.