Skip to content

Commit abf5d23

Browse files
authored
feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) (#925)
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
1 parent f93b2ad commit abf5d23

6 files changed

Lines changed: 325 additions & 24 deletions

File tree

docs/accuracy/accuracy-benchmarking.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ system message).
7474
| `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
7575
| `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
7676
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
77+
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
7778

7879
## CLI Flags
7980

docs/accuracy/accuracy_stubs.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
99

10-
**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
10+
**Status summary:** With the AIME24 loader landing on top of the BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, and `AIME24Benchmark` are fully implemented; the remaining benchmarks (`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
1111

1212
## Table of Contents
1313

@@ -173,16 +173,16 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
173173
| 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
174174
| 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
175175
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
176+
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
176177
177178
### Still Stubbed
178179
179180
| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
180181
|---|-------|------|------------|----------------|-----------------|
181-
| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
182-
| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
183-
| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
184-
| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
185-
| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
182+
| 1 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
183+
| 2 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
184+
| 3 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
185+
| 4 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
186186
187187
**Each benchmark has 1 method to implement:**
188188
@@ -309,24 +309,24 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
309309
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
310310
|-----------|-------------|---------------|------------------|-------------------|
311311
| Graders | 7 (all) | 0 || 0 |
312-
| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
312+
| Benchmarks | 5 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) | 4 | 1 (`load_problems`) | 4 |
313313
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 || 0 |
314314
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 || 0 |
315315
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 || 0 |
316316
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 || 0 |
317317
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
318-
| **Total** | **15** | **6** | | **6** |
318+
| **Total** | **16** | **5** | | **5** |
319319

320320
### Self-Disabling Pattern
321321

322322
Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.
323323

324324
### Suggested Implementation Order
325325

326-
The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:
326+
The processors, exporters, all graders, and five benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) are already wired end-to-end. The remaining work is the four stub benchmarks; mirror the existing loader whose grader matches:
327327

328-
1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
329-
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
328+
1. **`aime25`, `math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_expr` (aime25) or `lighteval_latex` (math_500).
329+
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
330330
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
331331
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
332332

Lines changed: 86 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,108 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33

4+
"""AIME 2024 benchmark loader, aligned with the trt-llm lighteval reference.
5+
6+
Mirrors the recipe's ``acc_bench_lighteval.py`` configuration:
7+
8+
aime24 = LightevalTaskConfig(
9+
name="aime24",
10+
prompt_function=aime_prompt_fn,
11+
hf_repo="HuggingFaceH4/aime_2024",
12+
hf_subset="default",
13+
evaluation_splits=["train"],
14+
few_shots_split=None,
15+
few_shots_select=None,
16+
generation_size=32768,
17+
metric=[expr_gold_metric],
18+
)
19+
20+
The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
21+
the bare problem text — lighteval's prompt manager wraps it as a
22+
single user message with no instruction prefix and no few-shot
23+
priming (``few_shots_split=None``). We emit prompts the same way.
24+
Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric``
25+
extraction.
26+
27+
Reference:
28+
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:128
29+
"""
30+
431
from __future__ import annotations
532

6-
from typing import TYPE_CHECKING
33+
import asyncio
34+
from typing import TYPE_CHECKING, Any
35+
36+
from datasets import Dataset, load_dataset
737

8-
from aiperf.accuracy.models import BenchmarkProblem
38+
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
939
from aiperf.common.mixins import AIPerfLoggerMixin
1040

1141
if TYPE_CHECKING:
1242
from aiperf.config.resolution.plan import BenchmarkRun
1343

44+
DATASET_NAME = "HuggingFaceH4/aime_2024"
45+
TASK_NAME = "aime24"
46+
47+
# lighteval's aime24 task config: ``generation_size=32768`` to give
48+
# reasoning models room to think before emitting the boxed answer.
49+
DEFAULT_GENERATION_SIZE = 32768
50+
51+
# Schema field names in HuggingFaceH4/aime_2024 (lowercase, lighteval
52+
# canonical — distinct from the Maxwell-Jia mirror used by ``aime``).
53+
PROBLEM_FIELD = "problem"
54+
ANSWER_FIELD = "answer"
55+
1456

1557
class AIME24Benchmark(AIPerfLoggerMixin):
16-
"""Registered placeholder for a future AIME 2024 loader.
58+
"""AIME 2024 lighteval-aligned benchmark loader.
1759
18-
`load_problems()` intentionally raises NotImplementedError in this release;
19-
use the MMLU benchmark when a working accuracy loader is required.
60+
Loads ``HuggingFaceH4/aime_2024`` (train split) and emits one user
61+
message per problem containing the bare problem text — the format
62+
lighteval's ``aime_prompt_fn`` + ``PromptManager`` produce when
63+
``few_shots_split=None``. Pair with ``LightevalExprGrader`` for
64+
grading parity with the recipe.
2065
"""
2166

22-
def __init__(self, run: BenchmarkRun, **kwargs) -> None:
67+
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
2368
super().__init__(**kwargs)
2469
self.run = run
2570

2671
async def load_problems(
2772
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
2873
) -> list[BenchmarkProblem]:
29-
raise NotImplementedError(
30-
"aime24 benchmark is not yet implemented; only 'mmlu' is available in this release."
31-
)
74+
"""Load AIME24 problems and format them lighteval-style.
75+
76+
Args:
77+
tasks: Ignored — AIME24 has no subtasks.
78+
n_shots: Ignored — the lighteval reference is zero-shot
79+
(``few_shots_split=None``); accepting the parameter
80+
keeps the protocol uniform but emitting few-shots
81+
here would diverge from the reference.
82+
enable_cot: Ignored — lighteval's ``aime_prompt_fn`` does
83+
not add a CoT trigger; the model decides whether to
84+
reason based on the system prompt the user provides
85+
via ``--accuracy-system-prompt``.
86+
87+
Returns:
88+
One ``BenchmarkProblem`` per dataset row, in dataset
89+
order.
90+
"""
91+
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
92+
return await asyncio.to_thread(self._build_problems, ds)
93+
94+
def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
95+
problems: list[BenchmarkProblem] = []
96+
for row in ds:
97+
problem = row[PROBLEM_FIELD]
98+
messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
99+
problems.append(
100+
BenchmarkProblem(
101+
prompt=problem,
102+
ground_truth=str(row[ANSWER_FIELD]),
103+
task=TASK_NAME,
104+
metadata={"generation_size": DEFAULT_GENERATION_SIZE},
105+
raw_messages=messages,
106+
)
107+
)
108+
return problems

src/aiperf/plugin/plugins.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1255,11 +1255,12 @@ accuracy_benchmark:
12551255
aime24:
12561256
class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark
12571257
description: |
1258-
AIME 2024 benchmark with problems from the 2024 competition year.
1258+
AIME 2024 benchmark, aligned with the trt-llm benchmark recipe's
1259+
lighteval-backed configuration (HuggingFaceH4/aime_2024 + lighteval
1260+
``expr_gold_metric``).
12591261
metadata:
1260-
default_grader: math
1262+
default_grader: lighteval_expr
12611263
default_n_shots: 0
1262-
is_implemented: false
12631264

12641265
aime25:
12651266
class: aiperf.accuracy.benchmarks.aime25:AIME25Benchmark

tests/unit/accuracy/test_accuracy_config.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
2424
# so those names are absent from the stub lists.
2525
STUB_BENCHMARKS = (
26-
"aime24",
2726
"aime25",
2827
"math_500",
2928
"gpqa_diamond",

0 commit comments

Comments
 (0)