AIME 2025 scoring: inspect_ai numeric match uses endswith (false positives e.g. 711 vs 11)

## Summary

AIME 2025 evaluation uses `inspect_evals.aime2025`, whose scorer wraps `inspect_ai.scorer.match(numeric=True)` with default `location="end"`. After extracting the **last numeric token** from the model completion, inspect_ai compares it to the gold label with **`str.endswith`**, not numeric equality.

That produces false positives when a wrong answer shares a digit suffix with the correct label.

## Concrete example

| Model output | Gold | `match(numeric=True)` | Should be |
|--------------|------|------------------------|-----------|
| `ANSWER: 711` | `11` | **Correct** | Incorrect |
| `ANSWER: 149` | `49` | **Correct** | Incorrect |

On the `math-ai/aime25` test set (30 problems), labels `49` and `149` are both present; a prediction of `149` for a problem with answer `49` is scored correct.

Reproduction (inspect-ai 0.3.x):

```python
from inspect_ai.scorer._common import match_str
_, ok = match_str("ANSWER: 711", "11", location="end", numeric=True)
assert ok  # True — bug
```

## Root cause

In `inspect_ai/scorer/_common.py`, the numeric branch extracts the last number, then falls through to:

```python
elif location == "end":
    return answer, v.endswith(t)
```

`location="exact"` is not a viable workaround (it compares the entire chain-of-thought string to the target).

## Impact

- Inflated AIME accuracy on PostTrainBench runs
- Run-to-run score noise when combined with sampling (separate harness concern)
- Same pattern affects other inspect evals using `match(numeric=True)` (e.g. GSM8K)

## Upstream

The primary fix belongs in **inspect_ai** (use `==` on normalized numbers after extraction). See UKGovernmentBEIS/inspect_ai.

## Proposed PostTrainBench mitigation

Ship a local scorer that:

1. Parses `ANSWER:` when present, else uses the last numeric token
2. Grades with normalized **equality**
3. Replaces `inspect_evals/aime2025` in `evaluate.py` with a local inspect task

PR to follow on branch `fix/aime2025-numeric-equality-scorer`.

Model output	Gold	`match(numeric=True)`	Should be
`ANSWER: 711`	`11`	Correct	Incorrect
`ANSWER: 149`	`49`	Correct	Incorrect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIME 2025 scoring: inspect_ai numeric match uses endswith (false positives e.g. 711 vs 11) #44

Summary

Concrete example

Root cause

Impact

Upstream

Proposed PostTrainBench mitigation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AIME 2025 scoring: inspect_ai numeric match uses endswith (false positives e.g. 711 vs 11) #44

Description

Summary

Concrete example

Root cause

Impact

Upstream

Proposed PostTrainBench mitigation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions