Skip to content

Commit 69a37e4

Browse files
committed
feat(accuracy): implement AIME 2024 benchmark loader (AIP-875)
Stacks on AIP-874 (AIMEBenchmark + MathGrader). Implements AIME24Benchmark for the lighteval-canonical ``HuggingFaceH4/aime_2024`` mirror (lowercase ``problem``/``answer`` schema, distinct from the Maxwell-Jia mirror used by the year-agnostic ``aime`` loader). Reuses ``INSTRUCTION_PREFIX`` and ``DEFAULT_GENERATION_SIZE`` from ``aime.py`` so the prompt format and generation cap stay in lockstep across the AIME family. Loader (src/aiperf/accuracy/benchmarks/aime24.py): - Pulls problems from HuggingFaceH4/aime_2024, train split. - Same chat-message structure as AIMEBenchmark: instruction on the first user message, ``\boxed{answer}`` assistant primers, sequential few-shot draws. - Pairs with MathGrader (default grader) for numerical equivalence. - All field names (PROBLEM_FIELD, ANSWER_FIELD, DATASET_NAME, TASK_NAME) are named constants; no magic literals. Tests (29 new, tests/unit/accuracy/test_aime24_benchmark.py): - Format-prompt: instruction prefix, problem text, CoT step-by-step, few-shot \\boxed{} priming, gold answer not leaked into test query. - Chat-message construction: zero-shot single-message, instruction on first user only, user/assistant/user pairing, assistant boxed format. - Few-shot sampling: zero/negative shots, clamping to dataset size, sequential draw order. - Load-problems end-to-end with a mocked HuggingFace Dataset: per-row output shape, ground-truth stringification, task name, raw_messages populated, default generation size in metadata, tasks argument ignored, distinct task name from ``aime``. - Pathological dataset rows: empty splits, unicode problem text, very long problems, n_shots larger than dataset size. Documentation: - docs/accuracy/accuracy-benchmarking.md availability table now lists ``aime24`` next to ``aime`` with their distinct dataset sources. - docs/accuracy/accuracy_stubs.md status table moves AIME24Benchmark from "Still Stubbed" to "Implemented". Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
1 parent 8b16196 commit 69a37e4

4 files changed

Lines changed: 527 additions & 11 deletions

File tree

docs/accuracy/accuracy-benchmarking.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ system message).
7272
|---|---|---|---|
7373
| `mmlu` | `multiple_choice` | 5 | `lighteval/mmlu` (57 subjects) |
7474
| `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
75+
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference) |
7576

7677
## CLI Flags
7778

docs/accuracy/accuracy_stubs.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -170,18 +170,18 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
170170
|---|-------|------|------------|----------------|-----------------|-------|
171171
| 1 | `MMLUBenchmark` | `benchmarks/mmlu.py` | `mmlu` | `multiple_choice` | 5 | **IMPLEMENTED in PR #815** — canonical reference for new benchmarks. Downloads via HuggingFace datasets, handles few-shot formatting and CoT. |
172172
| 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 0 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. |
173+
| 3 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (lighteval canonical, lowercase schema). Same prompt format as `aime`. |
173174

174175
### Still Stubbed
175176

176177
| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
177178
|---|-------|------|------------|----------------|-----------------|
178179
| 1 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `multiple_choice` | 0 |
179180
| 2 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 |
180-
| 3 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
181-
| 4 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
182-
| 5 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
183-
| 6 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
184-
| 7 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
181+
| 3 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
182+
| 4 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
183+
| 5 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
184+
| 6 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
185185

186186
**Each benchmark has 1 method to implement:**
187187

Lines changed: 156 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,171 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33

4-
from aiperf.accuracy.models import BenchmarkProblem
4+
"""AIME 2024 benchmark loader, ported from lighteval's aime24 task config.
5+
6+
Loads the ``HuggingFaceH4/aime_2024`` dataset (LightEval's canonical
7+
AIME 2024 mirror, with lowercase ``problem``/``answer`` field names) and
8+
formats each problem the same way as :mod:`aiperf.accuracy.benchmarks.aime`
9+
so the prompt + chat construction stays consistent across the AIME family.
10+
The split between ``aime`` and ``aime24`` is deliberate: ``aime`` is the
11+
year-agnostic identifier (DeepEval/Maxwell-Jia capitalized schema), while
12+
``aime24`` pins to lighteval's canonical mirror so users running
13+
side-by-side comparisons against lighteval get matching prompts.
14+
15+
lighteval reference: lighteval/src/lighteval/tasks/extended/aime/main.py
16+
"""
17+
18+
from __future__ import annotations
19+
20+
import asyncio
21+
from typing import Any
22+
23+
from datasets import Dataset, load_dataset
24+
25+
from aiperf.accuracy.benchmarks.aime import (
26+
DEFAULT_GENERATION_SIZE,
27+
INSTRUCTION_PREFIX,
28+
)
29+
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
530
from aiperf.common.config import UserConfig
631
from aiperf.common.mixins import AIPerfLoggerMixin
732

33+
DATASET_NAME = "HuggingFaceH4/aime_2024"
34+
TASK_NAME = "aime24"
35+
36+
# Field names in the HuggingFaceH4/aime_2024 schema (lowercase, distinct
37+
# from the Maxwell-Jia mirror used by AIMEBenchmark).
38+
PROBLEM_FIELD = "problem"
39+
ANSWER_FIELD = "answer"
40+
841

942
class AIME24Benchmark(AIPerfLoggerMixin):
10-
"""AIME 2024 benchmark loader."""
43+
"""AIME 2024 benchmark loader (lighteval canonical schema).
44+
45+
Loads competition problems from ``HuggingFaceH4/aime_2024`` (train
46+
split) and produces ``BenchmarkProblem`` objects ready for both the
47+
completions endpoint (flat ``prompt``) and the chat endpoint
48+
(``raw_messages``). Pairs with ``MathGrader`` for numerical
49+
equivalence; instruction prefix and generation size are reused from
50+
:mod:`aiperf.accuracy.benchmarks.aime` so the prompt format stays in
51+
lockstep across the AIME family.
52+
"""
1153

12-
def __init__(self, user_config: UserConfig, **kwargs) -> None:
54+
def __init__(self, user_config: UserConfig, **kwargs: Any) -> None:
1355
super().__init__(**kwargs)
1456
self.user_config = user_config
1557

1658
async def load_problems(
1759
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
1860
) -> list[BenchmarkProblem]:
19-
raise NotImplementedError(
20-
"aime24 benchmark is not yet implemented; only 'mmlu' is available in this release."
21-
)
61+
"""Load every AIME 2024 problem and format it for the LLM.
62+
63+
Args:
64+
tasks: Ignored — AIME 2024 has no subtasks. Accepted for
65+
protocol parity with benchmarks that do filter.
66+
n_shots: Number of few-shot examples to prepend (drawn from
67+
the start of the dataset). 0 disables few-shot prompting.
68+
enable_cot: When True, append ``Let's think step by step.`` to
69+
each query.
70+
71+
Returns:
72+
One ``BenchmarkProblem`` per dataset row, in dataset order.
73+
"""
74+
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
75+
return await asyncio.to_thread(self._build_problems, ds, n_shots, enable_cot)
76+
77+
def _build_problems(
78+
self, ds: Dataset, n_shots: int, enable_cot: bool
79+
) -> list[BenchmarkProblem]:
80+
few_shots = self._build_few_shots(ds, n_shots)
81+
problems: list[BenchmarkProblem] = []
82+
for row in ds:
83+
prompt = self._format_prompt(row, few_shots, enable_cot)
84+
raw_messages = self._build_chat_messages(row, few_shots, enable_cot)
85+
problems.append(
86+
BenchmarkProblem(
87+
prompt=prompt,
88+
ground_truth=str(row[ANSWER_FIELD]),
89+
task=TASK_NAME,
90+
metadata={"generation_size": DEFAULT_GENERATION_SIZE},
91+
raw_messages=raw_messages,
92+
)
93+
)
94+
return problems
95+
96+
def _build_few_shots(self, ds: Dataset, n_shots: int) -> list[dict[str, str]]:
97+
"""Few-shot examples drawn sequentially from the start of the split.
98+
99+
The HuggingFaceH4 mirror has no separate dev/validation split, so
100+
early problems can appear in their own prompts; lighteval makes
101+
the same trade-off when no held-out pool is available.
102+
"""
103+
if n_shots <= 0:
104+
return []
105+
size = min(n_shots, len(ds))
106+
return [self._format_example(ds[i]) for i in range(size)]
107+
108+
def _format_example(self, row: dict[str, Any]) -> dict[str, str]:
109+
"""Format a dataset row as a few-shot example with ``\\boxed{}``."""
110+
answer = str(row[ANSWER_FIELD])
111+
problem = row[PROBLEM_FIELD]
112+
return {
113+
"problem": problem,
114+
"answer": answer,
115+
"formatted": f"Problem: {problem}\nAnswer: \\boxed{{{answer}}}",
116+
}
117+
118+
def _format_prompt(
119+
self,
120+
row: dict[str, Any],
121+
few_shots: list[dict[str, str]],
122+
enable_cot: bool,
123+
) -> str:
124+
"""Build the flat completions prompt: instruction + shots + query."""
125+
few_shot_text = "\n\n".join(ex["formatted"] for ex in few_shots)
126+
if few_shot_text:
127+
few_shot_text += "\n\n"
128+
129+
problem = row[PROBLEM_FIELD]
130+
if enable_cot:
131+
query = f"Problem: {problem}\nLet's think step by step.\nAnswer:"
132+
else:
133+
query = f"Problem: {problem}\nAnswer:"
134+
135+
return INSTRUCTION_PREFIX + few_shot_text + query
136+
137+
def _build_chat_messages(
138+
self,
139+
row: dict[str, Any],
140+
few_shots: list[dict[str, str]],
141+
enable_cot: bool,
142+
) -> list[AccuracyChatMessage]:
143+
"""Build multi-turn chat messages following lighteval's PromptManager.
144+
145+
Identical structure to :class:`aiperf.accuracy.benchmarks.aime.AIMEBenchmark`:
146+
instruction lives on the first user message, assistant primers
147+
contain ``\\boxed{answer}``, and the trailing user message has no
148+
re-instruction unless there were zero few-shots.
149+
"""
150+
messages: list[AccuracyChatMessage] = []
151+
152+
for ix, ex in enumerate(few_shots):
153+
q = f"Problem: {ex['problem']}\nAnswer:"
154+
if ix == 0:
155+
q = INSTRUCTION_PREFIX + q
156+
messages.append({"role": "user", "content": q})
157+
messages.append(
158+
{"role": "assistant", "content": f"\\boxed{{{ex['answer']}}}"}
159+
)
160+
161+
problem = row[PROBLEM_FIELD]
162+
if enable_cot:
163+
main_q = f"Problem: {problem}\nLet's think step by step.\nAnswer:"
164+
else:
165+
main_q = f"Problem: {problem}\nAnswer:"
166+
167+
if not few_shots:
168+
main_q = INSTRUCTION_PREFIX + main_q
169+
170+
messages.append({"role": "user", "content": main_q})
171+
return messages

0 commit comments

Comments
 (0)