Skip to content

bit-incarnas/chat-vs-raw-methodology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

/v1/completions vs /v1/chat/completions: a task-dependent benchmark gap

When you benchmark an instruction-tuned LLM, the API path you choose matters more than almost any sampler knob. On two Qwen3.x models published as GGUFs, switching from lm-eval-harness's default /v1/completions mode to /v1/chat/completions produces score deltas ranging from 0pp to 62pp on the same prompts.

This document writes up that finding with:

  • Two case studies on real published models
  • A clean mechanism explanation
  • The reproduction recipe (exact commands)
  • The three small Python probes used to measure it

Author: @Incarnas. Disclosed: the work below was investigated and drafted with a Claude agent inside the operator's bench pipeline. Numbers are from real runs on real artifacts; code in tools/ runs as-is.

The headline

Two real models, two real benchmarks, same prompts, same scorer. The only thing that changed was whether the prompts went to /v1/completions (raw) or /v1/chat/completions (chat-template framing applied).

Model Quant Task Raw /v1/completions Chat /v1/chat/completions Δ
Qwen3.6-35B-A3B UD-Q8_K_XL gsm8k strict 35% 97% +62pp
Qwen3.5-122B-A10B NVFP4 ifeval prompt-strict 44% 90% +46pp
Qwen3.5-122B-A10B NVFP4 gsm8k strict 83% 89% +6pp
Qwen3.5-122B-A10B NVFP4 HumanEval pass@1 96% 96% 0pp

The gap is task-dependent and quality-tier-dependent. It is NOT a uniform "raw mode suppresses scores."

Why this matters

lm-eval-harness is the canonical leaderboard tooling. It defaults to /v1/completions for API-served models. That means a number you publish from a stock lm_eval --model local-completions ... run is a raw-mode number.

But almost no user of an instruction-tuned model actually hits the model through raw /v1/completions. Chat clients, aider, terminal-bench, Open WebUI, agentic harnesses, code-assist integrations — they all use /v1/chat/completions because that's the endpoint the model was trained against. So lm-eval's number on instruction-following tasks systematically understates what an actual user of the model experiences.

The finding came up the hard way. The author published Incarnas/Qwen3.5-122B-A10B-NVFP4-GGUF with raw-mode capability numbers in its model card on 2026-05-14. Two days later, while investigating an unrelated bench discrepancy on a sibling 35B model, the chat-vs-raw gap surfaced: same Q8_K_XL model that scored 35% gsm8k under raw mode scored 97% under chat mode. The 122B card's numbers immediately came under suspicion.

The fix was to bench the same artifact under chat-completions, update the model card (leading with chat-mode numbers, retaining raw mode as a "community-comparable" secondary table), and write this up.

When the gap is large vs small

The chat-vs-raw delta is not uniform. From the data above:

  • Completion-pattern tasks like gsm8k and HumanEval can show near-zero gap at quality tier (122B + gsm8k: +6pp, within extractor noise). These tasks present as next-token completion — 5-shot Q/A patterns, function-completion prompts — that a strong-enough model pattern-matches regardless of API path.
  • Instruction-following tasks like ifeval show large gaps regardless of model size. ifeval prompts are stacked constraints ("write a 300+ word summary, no commas, highlight 3 sections in markdown"). Without the chat template's role framing, the model treats the prompt as text to continue, not rules to satisfy. On the 122B NVFP4 the gap was +46pp.
  • Smaller models amplify the gap on tasks where it exists. Qwen3.6-35B-A3B Q8 went 35% → 97% on gsm8k under chat mode. The same model at the same quant under raw mode plus a more demanding extractor regime failed at the prompt-format level too often. 122B handled the raw gsm8k 5-shot pattern competently; 35B did not.

Practical rules:

  1. For an instruction-tuned model, the chat-mode number is what your users actually see.
  2. Raw-mode numbers are valid for community comparability with HF-leaderboard methodology. They should not be cited as the model's user-facing capability without disclosure of the API path used.
  3. The gap is largest on instruction-following tasks (ifeval-style). Code-completion and few-shot Q/A may be unaffected at scale.

Mechanism

/v1/completions takes a raw text input and produces a continuation. No template applied. On lm-eval-harness's ifeval task, the prompt is sent to the API exactly as it appears in the dataset — a single line of instruction text.

/v1/chat/completions takes a messages array (role-tagged), applies the model's chat template (Qwen3.x: <|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n plus thinking-mode tokens if enabled), and produces a completion conditioned on that templated prefix.

For an instruction-tuned model, the chat-template framing is what activates the model's "I am an assistant following instructions" behavior. Without it, the model continues to generate text in a register compatible with the prompt's surface form — which for an ifeval prompt looks more like "what would naturally follow this text" than "satisfy these constraints."

A concrete demonstration: see samples/ifeval-raymond-iii-chat-mode.md — the model receives "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections...", and under chat-completions it produces a 477-token, no-comma, 3-highlighted-section response. Same prompt under raw mode at n=200: 44% prompt-strict, meaning 56% of the time the model fails to satisfy at least one of the constraints.

The chat-mode probe in tools/bench_chat_ifeval.py imports lm-eval-harness's instructions_registry checker directly — the same scorer. The only thing that changes is the API path. That isolates the methodology variable cleanly:

from lm_eval.tasks.ifeval.utils import (
    InputExample,
    test_instruction_following_strict,
    test_instruction_following_loose,
)

What's in this repo

File Purpose
README.md This document
recipe.md Exact lm-eval-harness invocation and bench_chat_* invocations to reproduce the numbers
tools/bench_chat_ifeval.py Chat-mode ifeval probe (264 lines, stdlib + lm-eval scorer)
tools/bench_chat_math.py Chat-mode gsm8k + MATH-500 probe (427 lines, stdlib + sympy for LaTeX equivalence)
tools/bench_chat_humaneval.py Chat-mode HumanEval probe (328 lines, stdlib + subprocess sandbox)
samples/ifeval-raymond-iii-chat-mode.md One real chat-mode sample showing the model satisfying all 3 ifeval constraints
results/122b-nvfp4-summary.md Both raw and chat number tables on the 122B NVFP4 artifact

The probes are simple Python over urllib.request. No torch, no transformers, no harness install required to run them — the only heavy dep is lm_eval itself when you want the same scorer the leaderboard uses.

The probes in tools/ are point-in-time snapshots from the runs that produced these numbers. They are self-contained — no upstream dependency on a larger pipeline.

Reproducing the result

See recipe.md for the exact commands. Quick form:

# Raw mode (lm-eval-harness, the methodology that produced the May 14 numbers)
lm_eval \
  --model local-completions \
  --model_args base_url=http://localhost:11436/v1/completions,model=YOUR_MODEL,max_length=16384 \
  --tasks ifeval \
  --gen_kwargs do_sample=False,temperature=0.0,max_gen_toks=1280 \
  --limit 200

# Chat mode (this finding, the methodology that produced the corrected numbers)
python tools/bench_chat_ifeval.py http://localhost:11436 my-run --limit 100 --no-thinking

The two should converge on completion-pattern tasks (gsm8k, HumanEval) and diverge on instruction-following tasks (ifeval).

Implications for benchmark interpretation

If you are a model author publishing a GGUF or finetune:

  • Disclose the API mode in your capability tables. "ifeval prompt-strict 44%" tells the reader different things if it's /v1/completions vs /v1/chat/completions. Future model cards should treat api_mode: chat | raw_completions as a first-class disclosure.
  • If you only have one set of numbers, publish chat-mode as your primary table. That's what your users experience. Keep raw-mode as a secondary table for community comparability with HF-leaderboard methodology.
  • Update prior cards if necessary. Cards published before this distinction was understood (the author's included) may overstate weakness on instruction-following tasks. The honest move is to update with disclosure — "current numbers lead, prior numbers retained, here's what changed and why." See the Qwen3.5-122B-A10B-NVFP4 card commit 25c93c4 for one execution of that pattern.

If you are an operator consuming model benchmarks:

  • Treat ifeval / IFBench / similar instruction-following numbers from lm-eval-harness as a lower bound on what your chat-mode users will experience.
  • gsm8k, HumanEval, MBPP+ from raw mode at scale (122B-class) are reasonable proxies for chat-mode capability on those tasks. Same finding does not necessarily extend to smaller models.
  • Run your own benches when the stakes warrant it. The probes in tools/ are designed to be small enough that you can read them in 10 minutes and adapt to your own endpoint.

Methodology limits

Two case studies. Two model families (both Qwen3.x). Two quant types (UD-Q8_K_XL, NVFP4). The "task-dependent + tier-dependent" generalization is the cleanest read of these numbers, but it is not exhaustive. Open questions:

  • Does the gap pattern hold on other model families (Llama, Mistral, Gemma)? Untested.
  • Does it hold at higher quants (BF16) and lower quants (Q3, IQ2)? Untested.
  • Are there ifeval-adjacent benchmarks (e.g., constraint-stacking benchmarks not in the lm-eval suite) where the gap is even larger? Plausible; untested.
  • Does the gap shrink as models specialize in tool-use / agentic surfaces and become more robust to template-free input? Plausible; would need newer-generation models to test.

Contributions of additional model + task pairs welcome. The probe scripts are small enough to fit alongside almost any operator pipeline.

Citation

If this finding is useful in your work, the citable artifact is this document at its public URL plus the linked model card commit. Or:

Incarnas (2026). "lm-eval-harness raw vs chat-completions: a task-dependent benchmark gap on instruction-tuned LLMs." GitHub repository, https://github.com/bit-incarnas/chat-vs-raw-methodology. Date accessed.

Acknowledgments

The chat-mode probes were built on 2026-05-15 PDT in response to an unrelated bench investigation that surfaced the 35B-Q8 gsm8k 35→97 gap. The 122B NVFP4 re-bench and this writeup were drafted with a Claude Code agent inside the operator's bench pipeline; the agent identification is part of @Incarnas's public-disclosure posture (pinned on the X account).

The chat-mode probe bench_chat_ifeval.py imports the official lm-eval-harness instructions_registry scorer to keep the methodology comparison clean. All credit for the scorer + ifeval task design to the EleutherAI lm-eval-harness project.

About

Task-dependent benchmark gap between /v1/completions and /v1/chat/completions on instruction-tuned LLMs -- two case studies on Qwen3.x GGUFs, reproduction recipe, and probes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages