Skip to content

AIME 2025 scoring: inspect_ai numeric match uses endswith (false positives e.g. 711 vs 11) #44

@wise-east

Description

@wise-east

Summary

AIME 2025 evaluation uses inspect_evals.aime2025, whose scorer wraps inspect_ai.scorer.match(numeric=True) with default location="end". After extracting the last numeric token from the model completion, inspect_ai compares it to the gold label with str.endswith, not numeric equality.

That produces false positives when a wrong answer shares a digit suffix with the correct label.

Concrete example

Model output Gold match(numeric=True) Should be
ANSWER: 711 11 Correct Incorrect
ANSWER: 149 49 Correct Incorrect

On the math-ai/aime25 test set (30 problems), labels 49 and 149 are both present; a prediction of 149 for a problem with answer 49 is scored correct.

Reproduction (inspect-ai 0.3.x):

from inspect_ai.scorer._common import match_str
_, ok = match_str("ANSWER: 711", "11", location="end", numeric=True)
assert ok  # True — bug

Root cause

In inspect_ai/scorer/_common.py, the numeric branch extracts the last number, then falls through to:

elif location == "end":
    return answer, v.endswith(t)

location="exact" is not a viable workaround (it compares the entire chain-of-thought string to the target).

Impact

  • Inflated AIME accuracy on PostTrainBench runs
  • Run-to-run score noise when combined with sampling (separate harness concern)
  • Same pattern affects other inspect evals using match(numeric=True) (e.g. GSM8K)

Upstream

The primary fix belongs in inspect_ai (use == on normalized numbers after extraction). See UKGovernmentBEIS/inspect_ai.

Proposed PostTrainBench mitigation

Ship a local scorer that:

  1. Parses ANSWER: when present, else uses the last numeric token
  2. Grades with normalized equality
  3. Replaces inspect_evals/aime2025 in evaluate.py with a local inspect task

PR to follow on branch fix/aime2025-numeric-equality-scorer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions