Summary
AIME 2025 evaluation uses inspect_evals.aime2025, whose scorer wraps inspect_ai.scorer.match(numeric=True) with default location="end". After extracting the last numeric token from the model completion, inspect_ai compares it to the gold label with str.endswith, not numeric equality.
That produces false positives when a wrong answer shares a digit suffix with the correct label.
Concrete example
| Model output |
Gold |
match(numeric=True) |
Should be |
ANSWER: 711 |
11 |
Correct |
Incorrect |
ANSWER: 149 |
49 |
Correct |
Incorrect |
On the math-ai/aime25 test set (30 problems), labels 49 and 149 are both present; a prediction of 149 for a problem with answer 49 is scored correct.
Reproduction (inspect-ai 0.3.x):
from inspect_ai.scorer._common import match_str
_, ok = match_str("ANSWER: 711", "11", location="end", numeric=True)
assert ok # True — bug
Root cause
In inspect_ai/scorer/_common.py, the numeric branch extracts the last number, then falls through to:
elif location == "end":
return answer, v.endswith(t)
location="exact" is not a viable workaround (it compares the entire chain-of-thought string to the target).
Impact
- Inflated AIME accuracy on PostTrainBench runs
- Run-to-run score noise when combined with sampling (separate harness concern)
- Same pattern affects other inspect evals using
match(numeric=True) (e.g. GSM8K)
Upstream
The primary fix belongs in inspect_ai (use == on normalized numbers after extraction). See UKGovernmentBEIS/inspect_ai.
Proposed PostTrainBench mitigation
Ship a local scorer that:
- Parses
ANSWER: when present, else uses the last numeric token
- Grades with normalized equality
- Replaces
inspect_evals/aime2025 in evaluate.py with a local inspect task
PR to follow on branch fix/aime2025-numeric-equality-scorer.
Summary
AIME 2025 evaluation uses
inspect_evals.aime2025, whose scorer wrapsinspect_ai.scorer.match(numeric=True)with defaultlocation="end". After extracting the last numeric token from the model completion, inspect_ai compares it to the gold label withstr.endswith, not numeric equality.That produces false positives when a wrong answer shares a digit suffix with the correct label.
Concrete example
match(numeric=True)ANSWER: 71111ANSWER: 14949On the
math-ai/aime25test set (30 problems), labels49and149are both present; a prediction of149for a problem with answer49is scored correct.Reproduction (inspect-ai 0.3.x):
Root cause
In
inspect_ai/scorer/_common.py, the numeric branch extracts the last number, then falls through to:location="exact"is not a viable workaround (it compares the entire chain-of-thought string to the target).Impact
match(numeric=True)(e.g. GSM8K)Upstream
The primary fix belongs in inspect_ai (use
==on normalized numbers after extraction). See UKGovernmentBEIS/inspect_ai.Proposed PostTrainBench mitigation
Ship a local scorer that:
ANSWER:when present, else uses the last numeric tokeninspect_evals/aime2025inevaluate.pywith a local inspect taskPR to follow on branch
fix/aime2025-numeric-equality-scorer.