feat(accuracy): implement AIME 2024 benchmark loader (AIP-875)

debermudez · debermudez · commit 69a37e436779 · 2026-05-12T12:09:45.000-07:00
Stacks on AIP-874 (AIMEBenchmark + MathGrader).

Implements AIME24Benchmark for the lighteval-canonical
``HuggingFaceH4/aime_2024`` mirror (lowercase ``problem``/``answer``
schema, distinct from the Maxwell-Jia mirror used by the year-agnostic
``aime`` loader). Reuses ``INSTRUCTION_PREFIX`` and
``DEFAULT_GENERATION_SIZE`` from ``aime.py`` so the prompt format and
generation cap stay in lockstep across the AIME family.

Loader (src/aiperf/accuracy/benchmarks/aime24.py):
- Pulls problems from HuggingFaceH4/aime_2024, train split.
- Same chat-message structure as AIMEBenchmark: instruction on the first
  user message, ``\boxed{answer}`` assistant primers, sequential few-shot
  draws.
- Pairs with MathGrader (default grader) for numerical equivalence.
- All field names (PROBLEM_FIELD, ANSWER_FIELD, DATASET_NAME, TASK_NAME)
  are named constants; no magic literals.

Tests (29 new, tests/unit/accuracy/test_aime24_benchmark.py):
- Format-prompt: instruction prefix, problem text, CoT step-by-step,
  few-shot \\boxed{} priming, gold answer not leaked into test query.
- Chat-message construction: zero-shot single-message, instruction on
  first user only, user/assistant/user pairing, assistant boxed format.
- Few-shot sampling: zero/negative shots, clamping to dataset size,
  sequential draw order.
- Load-problems end-to-end with a mocked HuggingFace Dataset: per-row
  output shape, ground-truth stringification, task name, raw_messages
  populated, default generation size in metadata, tasks argument
  ignored, distinct task name from ``aime``.
- Pathological dataset rows: empty splits, unicode problem text, very
  long problems, n_shots larger than dataset size.

Documentation:
- docs/accuracy/accuracy-benchmarking.md availability table now lists
  ``aime24`` next to ``aime`` with their distinct dataset sources.
- docs/accuracy/accuracy_stubs.md status table moves AIME24Benchmark
  from "Still Stubbed" to "Implemented".

Signed-off-by: Elias Bermudez &lt;dbermudez@nvidia.com&gt;
diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
@@ -72,6 +72,7 @@ system message).
 |---|---|---|---|
 | `mmlu` | `multiple_choice` | 5 | `lighteval/mmlu` (57 subjects) |
 | `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
+| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference) |
 
 ## CLI Flags
 
diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
@@ -170,18 +170,18 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
 |---|-------|------|------------|----------------|-----------------|-------|
 | 1 | `MMLUBenchmark` | `benchmarks/mmlu.py` | `mmlu` | `multiple_choice` | 5 | **IMPLEMENTED in PR #815** — canonical reference for new benchmarks. Downloads via HuggingFace datasets, handles few-shot formatting and CoT. |
 | 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 0 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. |
+| 3 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (lighteval canonical, lowercase schema). Same prompt format as `aime`. |
 
 ### Still Stubbed
 
 | # | Class | File | Plugin Key | Default Grader | Default N-Shots |
 |---|-------|------|------------|----------------|-----------------|
 | 1 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `multiple_choice` | 0 |
 | 2 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 |
-| 3 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
-| 4 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
-| 5 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
-| 6 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 7 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 3 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
+| 4 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
+| 5 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
+| 6 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
 
 **Each benchmark has 1 method to implement:**
 
diff --git a/src/aiperf/accuracy/benchmarks/aime24.py b/src/aiperf/accuracy/benchmarks/aime24.py
@@ -1,21 +1,171 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 
-from aiperf.accuracy.models import BenchmarkProblem
+"""AIME 2024 benchmark loader, ported from lighteval's aime24 task config.
+
+Loads the ``HuggingFaceH4/aime_2024`` dataset (LightEval's canonical
+AIME 2024 mirror, with lowercase ``problem``/``answer`` field names) and
+formats each problem the same way as :mod:`aiperf.accuracy.benchmarks.aime`
+so the prompt + chat construction stays consistent across the AIME family.
+The split between ``aime`` and ``aime24`` is deliberate: ``aime`` is the
+year-agnostic identifier (DeepEval/Maxwell-Jia capitalized schema), while
+``aime24`` pins to lighteval's canonical mirror so users running
+side-by-side comparisons against lighteval get matching prompts.
+
+lighteval reference: lighteval/src/lighteval/tasks/extended/aime/main.py
+"""
+
+from __future__ import annotations
+
+import asyncio
+from typing import Any
+
+from datasets import Dataset, load_dataset
+
+from aiperf.accuracy.benchmarks.aime import (
+    DEFAULT_GENERATION_SIZE,
+    INSTRUCTION_PREFIX,
+)
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
 from aiperf.common.config import UserConfig
 from aiperf.common.mixins import AIPerfLoggerMixin
 
+DATASET_NAME = "HuggingFaceH4/aime_2024"
+TASK_NAME = "aime24"
+
+# Field names in the HuggingFaceH4/aime_2024 schema (lowercase, distinct
+# from the Maxwell-Jia mirror used by AIMEBenchmark).
+PROBLEM_FIELD = "problem"
+ANSWER_FIELD = "answer"
+
 
 class AIME24Benchmark(AIPerfLoggerMixin):
-    """AIME 2024 benchmark loader."""
+    """AIME 2024 benchmark loader (lighteval canonical schema).
+
+    Loads competition problems from ``HuggingFaceH4/aime_2024`` (train
+    split) and produces ``BenchmarkProblem`` objects ready for both the
+    completions endpoint (flat ``prompt``) and the chat endpoint
+    (``raw_messages``). Pairs with ``MathGrader`` for numerical
+    equivalence; instruction prefix and generation size are reused from
+    :mod:`aiperf.accuracy.benchmarks.aime` so the prompt format stays in
+    lockstep across the AIME family.
+    """
 
-    def __init__(self, user_config: UserConfig, **kwargs) -> None:
+    def __init__(self, user_config: UserConfig, **kwargs: Any) -> None:
         super().__init__(**kwargs)
         self.user_config = user_config
 
     async def load_problems(
         self, tasks: list[str] | None, n_shots: int, enable_cot: bool
     ) -> list[BenchmarkProblem]:
-        raise NotImplementedError(
-            "aime24 benchmark is not yet implemented; only 'mmlu' is available in this release."
-        )
+        """Load every AIME 2024 problem and format it for the LLM.
+
+        Args:
+            tasks: Ignored — AIME 2024 has no subtasks. Accepted for
+                protocol parity with benchmarks that do filter.
+            n_shots: Number of few-shot examples to prepend (drawn from
+                the start of the dataset). 0 disables few-shot prompting.
+            enable_cot: When True, append ``Let's think step by step.`` to
+                each query.
+
+        Returns:
+            One ``BenchmarkProblem`` per dataset row, in dataset order.
+        """
+        ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
+        return await asyncio.to_thread(self._build_problems, ds, n_shots, enable_cot)
+
+    def _build_problems(
+        self, ds: Dataset, n_shots: int, enable_cot: bool
+    ) -> list[BenchmarkProblem]:
+        few_shots = self._build_few_shots(ds, n_shots)
+        problems: list[BenchmarkProblem] = []
+        for row in ds:
+            prompt = self._format_prompt(row, few_shots, enable_cot)
+            raw_messages = self._build_chat_messages(row, few_shots, enable_cot)
+            problems.append(
+                BenchmarkProblem(
+                    prompt=prompt,
+                    ground_truth=str(row[ANSWER_FIELD]),
+                    task=TASK_NAME,
+                    metadata={"generation_size": DEFAULT_GENERATION_SIZE},
+                    raw_messages=raw_messages,
+                )
+            )
+        return problems
+
+    def _build_few_shots(self, ds: Dataset, n_shots: int) -> list[dict[str, str]]:
+        """Few-shot examples drawn sequentially from the start of the split.
+
+        The HuggingFaceH4 mirror has no separate dev/validation split, so
+        early problems can appear in their own prompts; lighteval makes
+        the same trade-off when no held-out pool is available.
+        """
+        if n_shots <= 0:
+            return []
+        size = min(n_shots, len(ds))
+        return [self._format_example(ds[i]) for i in range(size)]
+
+    def _format_example(self, row: dict[str, Any]) -> dict[str, str]:
+        """Format a dataset row as a few-shot example with ``\\boxed{}``."""
+        answer = str(row[ANSWER_FIELD])
+        problem = row[PROBLEM_FIELD]
+        return {
+            "problem": problem,
+            "answer": answer,
+            "formatted": f"Problem: {problem}\nAnswer: \\boxed{{{answer}}}",
+        }
+
+    def _format_prompt(
+        self,
+        row: dict[str, Any],
+        few_shots: list[dict[str, str]],
+        enable_cot: bool,
+    ) -> str:
+        """Build the flat completions prompt: instruction + shots + query."""
+        few_shot_text = "\n\n".join(ex["formatted"] for ex in few_shots)
+        if few_shot_text:
+            few_shot_text += "\n\n"
+
+        problem = row[PROBLEM_FIELD]
+        if enable_cot:
+            query = f"Problem: {problem}\nLet's think step by step.\nAnswer:"
+        else:
+            query = f"Problem: {problem}\nAnswer:"
+
+        return INSTRUCTION_PREFIX + few_shot_text + query
+
+    def _build_chat_messages(
+        self,
+        row: dict[str, Any],
+        few_shots: list[dict[str, str]],
+        enable_cot: bool,
+    ) -> list[AccuracyChatMessage]:
+        """Build multi-turn chat messages following lighteval's PromptManager.
+
+        Identical structure to :class:`aiperf.accuracy.benchmarks.aime.AIMEBenchmark`:
+        instruction lives on the first user message, assistant primers
+        contain ``\\boxed{answer}``, and the trailing user message has no
+        re-instruction unless there were zero few-shots.
+        """
+        messages: list[AccuracyChatMessage] = []
+
+        for ix, ex in enumerate(few_shots):
+            q = f"Problem: {ex['problem']}\nAnswer:"
+            if ix == 0:
+                q = INSTRUCTION_PREFIX + q
+            messages.append({"role": "user", "content": q})
+            messages.append(
+                {"role": "assistant", "content": f"\\boxed{{{ex['answer']}}}"}
+            )
+
+        problem = row[PROBLEM_FIELD]
+        if enable_cot:
+            main_q = f"Problem: {problem}\nLet's think step by step.\nAnswer:"
+        else:
+            main_q = f"Problem: {problem}\nAnswer:"
+
+        if not few_shots:
+            main_q = INSTRUCTION_PREFIX + main_q
+
+        messages.append({"role": "user", "content": main_q})
+        return messages
diff --git a/tests/unit/accuracy/test_aime24_benchmark.py b/tests/unit/accuracy/test_aime24_benchmark.py