Skip to content

Commit 3e85d67

Browse files
committed
feat(accuracy): AIME 2025 lighteval-aligned benchmark loader (AIP-876)
Implement ``AIME25Benchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:aime25`` configuration: same ``aime_prompt_fn`` zero-shot rendering, ``generation_size=32768``, ``hf_repo="yentinglin/aime_2025"``. Same shape as ``AIME24Benchmark`` just pointed at the 2025 mirror. The loader emits one ``BenchmarkProblem`` per dataset row with the bare problem text as ``prompt``, ``str(answer)`` as ``ground_truth``, and ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` / ``enable_cot`` are accepted for protocol uniformity but ignored. Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric`` extraction. Built on top of AIP-875 (lighteval sub-stack ordering: 875 → 876). No heavy optional dependency — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``aime25`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``aime25`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_expr``, add the ``aime25`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
1 parent a90d154 commit 3e85d67

6 files changed

Lines changed: 283 additions & 22 deletions

File tree

docs/accuracy/accuracy-benchmarking.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ system message).
7575
| `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
7676
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
7777
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
78+
| `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
7879

7980
## CLI Flags
8081

docs/accuracy/accuracy_stubs.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
99

10-
**Status summary:** With the AIME24 loader landing on top of the BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, and `AIME24Benchmark` are fully implemented; the remaining benchmarks (`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
10+
**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
1111

1212
## Table of Contents
1313

@@ -174,15 +174,15 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
174174
| 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
175175
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
176176
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
177+
| 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
177178
178179
### Still Stubbed
179180
180181
| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
181182
|---|-------|------|------------|----------------|-----------------|
182-
| 1 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
183-
| 2 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
184-
| 3 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
185-
| 4 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
183+
| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
184+
| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
185+
| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
186186
187187
**Each benchmark has 1 method to implement:**
188188
@@ -309,23 +309,23 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
309309
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
310310
|-----------|-------------|---------------|------------------|-------------------|
311311
| Graders | 7 (all) | 0 || 0 |
312-
| Benchmarks | 5 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) | 4 | 1 (`load_problems`) | 4 |
312+
| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 |
313313
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 || 0 |
314314
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 || 0 |
315315
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 || 0 |
316316
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 || 0 |
317317
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
318-
| **Total** | **16** | **5** | | **5** |
318+
| **Total** | **17** | **4** | | **4** |
319319
320320
### Self-Disabling Pattern
321321
322322
Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.
323323
324324
### Suggested Implementation Order
325325
326-
The processors, exporters, all graders, and five benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) are already wired end-to-end. The remaining work is the four stub benchmarks; mirror the existing loader whose grader matches:
326+
The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches:
327327
328-
1. **`aime25`, `math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_expr` (aime25) or `lighteval_latex` (math_500).
328+
1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`.
329329
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
330330
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
331331
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
Lines changed: 62 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,84 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33

4+
"""AIME 2025 benchmark loader, aligned with the trt-llm lighteval reference.
5+
6+
Mirrors ``acc_bench_lighteval.py:aime25``: same ``aime_prompt_fn``,
7+
same zero-shot config, ``generation_size=32768``,
8+
``hf_repo="yentinglin/aime_2025"``. See the AIME24 module for a fuller
9+
explanation of the design.
10+
11+
Reference:
12+
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:142
13+
"""
14+
415
from __future__ import annotations
516

6-
from typing import TYPE_CHECKING
17+
import asyncio
18+
from typing import TYPE_CHECKING, Any
719

8-
from aiperf.accuracy.models import BenchmarkProblem
20+
from datasets import Dataset, load_dataset
21+
22+
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
923
from aiperf.common.mixins import AIPerfLoggerMixin
1024

1125
if TYPE_CHECKING:
1226
from aiperf.config.resolution.plan import BenchmarkRun
1327

28+
DATASET_NAME = "yentinglin/aime_2025"
29+
TASK_NAME = "aime25"
30+
31+
# lighteval's aime25 task config: ``generation_size=32768``.
32+
DEFAULT_GENERATION_SIZE = 32768
33+
34+
# Schema field names in yentinglin/aime_2025 (same lowercase shape as
35+
# AIME24's HuggingFaceH4 mirror).
36+
PROBLEM_FIELD = "problem"
37+
ANSWER_FIELD = "answer"
38+
1439

1540
class AIME25Benchmark(AIPerfLoggerMixin):
16-
"""Registered placeholder for a future AIME 2025 loader.
41+
"""AIME 2025 lighteval-aligned benchmark loader.
1742
18-
`load_problems()` intentionally raises NotImplementedError in this release;
19-
use the MMLU benchmark when a working accuracy loader is required.
43+
Loads ``yentinglin/aime_2025`` (train split) and emits one user
44+
message per problem containing the bare problem text — matching
45+
lighteval's zero-shot ``aime_prompt_fn`` rendering. Pair with
46+
``LightevalExprGrader`` for grading parity with the recipe.
2047
"""
2148

22-
def __init__(self, run: BenchmarkRun, **kwargs) -> None:
49+
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
2350
super().__init__(**kwargs)
2451
self.run = run
2552

2653
async def load_problems(
2754
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
2855
) -> list[BenchmarkProblem]:
29-
raise NotImplementedError(
30-
"aime25 benchmark is not yet implemented; only 'mmlu' is available in this release."
31-
)
56+
"""Load AIME25 problems and format them lighteval-style.
57+
58+
Args:
59+
tasks: Ignored — AIME25 has no subtasks.
60+
n_shots: Ignored — the lighteval reference is zero-shot.
61+
enable_cot: Ignored — lighteval's ``aime_prompt_fn`` does
62+
not add a CoT trigger.
63+
64+
Returns:
65+
One ``BenchmarkProblem`` per dataset row, in dataset order.
66+
"""
67+
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
68+
return await asyncio.to_thread(self._build_problems, ds)
69+
70+
def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
71+
problems: list[BenchmarkProblem] = []
72+
for row in ds:
73+
problem = row[PROBLEM_FIELD]
74+
messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
75+
problems.append(
76+
BenchmarkProblem(
77+
prompt=problem,
78+
ground_truth=str(row[ANSWER_FIELD]),
79+
task=TASK_NAME,
80+
metadata={"generation_size": DEFAULT_GENERATION_SIZE},
81+
raw_messages=messages,
82+
)
83+
)
84+
return problems

src/aiperf/plugin/plugins.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1265,11 +1265,12 @@ accuracy_benchmark:
12651265
aime25:
12661266
class: aiperf.accuracy.benchmarks.aime25:AIME25Benchmark
12671267
description: |
1268-
AIME 2025 benchmark with problems from the 2025 competition year.
1268+
AIME 2025 benchmark, aligned with the trt-llm benchmark recipe's
1269+
lighteval-backed configuration (yentinglin/aime_2025 + lighteval
1270+
``expr_gold_metric``).
12691271
metadata:
1270-
default_grader: math
1272+
default_grader: lighteval_expr
12711273
default_n_shots: 0
1272-
is_implemented: false
12731274

12741275
math_500:
12751276
class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark

tests/unit/accuracy/test_accuracy_config.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
2424
# so those names are absent from the stub lists.
2525
STUB_BENCHMARKS = (
26-
"aime25",
2726
"math_500",
2827
"gpqa_diamond",
2928
"lcb_codegeneration",
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
"""Unit tests for ``AIME25Benchmark`` after lighteval alignment.
5+
6+
Same shape as ``test_aime24_benchmark.py`` — the lighteval reference
7+
config is identical except for the dataset URL.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
from typing import Any
13+
from unittest.mock import MagicMock, patch
14+
15+
import pytest
16+
17+
from aiperf.accuracy.benchmarks.aime25 import (
18+
DEFAULT_GENERATION_SIZE,
19+
TASK_NAME,
20+
AIME25Benchmark,
21+
)
22+
from aiperf.accuracy.models import BenchmarkProblem
23+
from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType
24+
from tests.unit.conftest import make_benchmark_run
25+
26+
27+
def _make_run():
28+
return make_benchmark_run(
29+
model_names=["test-model"],
30+
endpoint_type=EndpointType.COMPLETIONS,
31+
streaming=False,
32+
accuracy={"benchmark": AccuracyBenchmarkType.AIME25},
33+
)
34+
35+
36+
def _make_row(problem: str = "What is 1+1?", answer: int = 2) -> dict[str, Any]:
37+
return {"problem": problem, "answer": answer}
38+
39+
40+
def _make_fake_dataset(rows: list[dict[str, Any]]) -> MagicMock:
41+
ds = MagicMock()
42+
ds.__iter__ = MagicMock(side_effect=lambda: iter(rows))
43+
ds.__len__ = MagicMock(return_value=len(rows))
44+
ds.__getitem__ = MagicMock(side_effect=lambda i: rows[i])
45+
return ds
46+
47+
48+
class TestPromptIsBareProblemText:
49+
@pytest.mark.asyncio
50+
async def test_flat_prompt_is_problem_text(self) -> None:
51+
rows = [_make_row("Compute the answer.", 42)]
52+
with patch(
53+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
54+
return_value=_make_fake_dataset(rows),
55+
):
56+
bench = AIME25Benchmark(run=_make_run())
57+
problems = await bench.load_problems(
58+
tasks=None, n_shots=0, enable_cot=False
59+
)
60+
assert problems[0].prompt == "Compute the answer."
61+
62+
@pytest.mark.asyncio
63+
async def test_no_instruction_prefix(self) -> None:
64+
rows = [_make_row("Q?", 1)]
65+
with patch(
66+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
67+
return_value=_make_fake_dataset(rows),
68+
):
69+
bench = AIME25Benchmark(run=_make_run())
70+
problems = await bench.load_problems(
71+
tasks=None, n_shots=0, enable_cot=False
72+
)
73+
prompt = problems[0].prompt
74+
assert "**Problem**" not in prompt
75+
assert "competition math" not in prompt
76+
assert "Let's think" not in prompt
77+
assert "boxed" not in prompt
78+
79+
@pytest.mark.asyncio
80+
async def test_chat_message_is_single_user_message(self) -> None:
81+
rows = [_make_row("Q?", 1)]
82+
with patch(
83+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
84+
return_value=_make_fake_dataset(rows),
85+
):
86+
bench = AIME25Benchmark(run=_make_run())
87+
problems = await bench.load_problems(
88+
tasks=None, n_shots=0, enable_cot=False
89+
)
90+
msgs = problems[0].raw_messages
91+
assert msgs is not None
92+
assert len(msgs) == 1
93+
assert msgs[0]["role"] == "user"
94+
assert msgs[0]["content"] == "Q?"
95+
96+
97+
class TestNShotsAndCoTAreIgnored:
98+
@pytest.mark.asyncio
99+
async def test_n_shots_argument_does_not_affect_prompt(self) -> None:
100+
rows = [_make_row(f"q{i}", i) for i in range(3)]
101+
with patch(
102+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
103+
return_value=_make_fake_dataset(rows),
104+
):
105+
bench = AIME25Benchmark(run=_make_run())
106+
zero_shot = await bench.load_problems(
107+
tasks=None, n_shots=0, enable_cot=False
108+
)
109+
five_shot = await bench.load_problems(
110+
tasks=None, n_shots=5, enable_cot=False
111+
)
112+
assert [p.prompt for p in zero_shot] == [p.prompt for p in five_shot]
113+
114+
@pytest.mark.asyncio
115+
async def test_enable_cot_does_not_affect_prompt(self) -> None:
116+
rows = [_make_row("Q?", 1)]
117+
with patch(
118+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
119+
return_value=_make_fake_dataset(rows),
120+
):
121+
bench = AIME25Benchmark(run=_make_run())
122+
no_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=False)
123+
with_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=True)
124+
assert no_cot[0].prompt == with_cot[0].prompt
125+
126+
127+
class TestLoadProblemsCore:
128+
@pytest.mark.asyncio
129+
async def test_returns_one_problem_per_row(self) -> None:
130+
rows = [_make_row(f"q{i}", i) for i in range(5)]
131+
with patch(
132+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
133+
return_value=_make_fake_dataset(rows),
134+
):
135+
bench = AIME25Benchmark(run=_make_run())
136+
problems = await bench.load_problems(
137+
tasks=None, n_shots=0, enable_cot=False
138+
)
139+
assert len(problems) == 5
140+
assert all(isinstance(p, BenchmarkProblem) for p in problems)
141+
142+
@pytest.mark.asyncio
143+
async def test_ground_truth_is_string_form_of_answer(self) -> None:
144+
rows = [_make_row("q", 42)]
145+
with patch(
146+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
147+
return_value=_make_fake_dataset(rows),
148+
):
149+
bench = AIME25Benchmark(run=_make_run())
150+
problems = await bench.load_problems(
151+
tasks=None, n_shots=0, enable_cot=False
152+
)
153+
assert problems[0].ground_truth == "42"
154+
155+
@pytest.mark.asyncio
156+
async def test_task_name_is_aime25(self) -> None:
157+
rows = [_make_row("q", 1)]
158+
with patch(
159+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
160+
return_value=_make_fake_dataset(rows),
161+
):
162+
bench = AIME25Benchmark(run=_make_run())
163+
problems = await bench.load_problems(
164+
tasks=None, n_shots=0, enable_cot=False
165+
)
166+
assert problems[0].task == TASK_NAME
167+
168+
@pytest.mark.asyncio
169+
async def test_generation_size_is_32k(self) -> None:
170+
rows = [_make_row("q", 1)]
171+
with patch(
172+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
173+
return_value=_make_fake_dataset(rows),
174+
):
175+
bench = AIME25Benchmark(run=_make_run())
176+
problems = await bench.load_problems(
177+
tasks=None, n_shots=0, enable_cot=False
178+
)
179+
assert problems[0].metadata["generation_size"] == DEFAULT_GENERATION_SIZE
180+
assert DEFAULT_GENERATION_SIZE == 32768
181+
182+
183+
class TestPathologicalDatasetRows:
184+
@pytest.mark.asyncio
185+
async def test_empty_dataset_returns_empty_list(self) -> None:
186+
with patch(
187+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
188+
return_value=_make_fake_dataset([]),
189+
):
190+
bench = AIME25Benchmark(run=_make_run())
191+
problems = await bench.load_problems(
192+
tasks=None, n_shots=0, enable_cot=False
193+
)
194+
assert problems == []
195+
196+
@pytest.mark.asyncio
197+
async def test_unicode_problem_text_preserved(self) -> None:
198+
rows = [_make_row("Solve ∑₁ⁿ k² for n=10. ✓", 385)]
199+
with patch(
200+
"aiperf.accuracy.benchmarks.aime25.load_dataset",
201+
return_value=_make_fake_dataset(rows),
202+
):
203+
bench = AIME25Benchmark(run=_make_run())
204+
problems = await bench.load_problems(
205+
tasks=None, n_shots=0, enable_cot=False
206+
)
207+
assert "∑₁ⁿ" in problems[0].prompt

0 commit comments

Comments
 (0)