feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) by debermudez · Pull Request #926 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:07:27Z

Plugin: registers aime25 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
Mirrors AIME24's lighteval-backed structure for the 2025 competition year; same grader (LightevalExprGrader, expr_gold_metric) introduced in AIP-874.
Depends on AIP-875 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime25 task)

Summary by CodeRabbit

New Features
- AIME25 benchmark is now implemented and available for evaluation (default grader updated to a lighteval-based grader, default n‑shots = 0).
Documentation
- Benchmark guides updated with AIME25 configuration, dataset source, and grading details; removed AIME25 from stubbed list.
Tests
- Added unit tests validating problem loading, prompt formatting, answers, and edge cases.
Chores
- Benchmark registry updated to mark AIME25 as implemented.

copy-pr-bot · 2026-05-12T23:07:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:07:41Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2fe

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2fe

Last updated for commit: 3e85d67 • Browse code

github-actions · 2026-05-12T23:08:08Z

Fern Docs Preview: https://nvidia-preview-084a4004-3ce9-4c29-8164-14293e7aad29.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:11:27Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 5 of 8 — base branch is dbermudez/aip-875-implement-aime24-benchmark-loader,
depends on #925 (AIP-875) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:36:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Implement ``AIME25Benchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:aime25`` configuration: same ``aime_prompt_fn`` zero-shot rendering, ``generation_size=32768``, ``hf_repo="yentinglin/aime_2025"``. Same shape as ``AIME24Benchmark`` just pointed at the 2025 mirror. The loader emits one ``BenchmarkProblem`` per dataset row with the bare problem text as ``prompt``, ``str(answer)`` as ``ground_truth``, and ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` / ``enable_cot`` are accepted for protocol uniformity but ignored. Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric`` extraction. Built on top of AIP-875 (lighteval sub-stack ordering: 875 → 876). No heavy optional dependency — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``aime25`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``aime25`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_expr``, add the ``aime25`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai · 2026-05-29T20:54:54Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54115ac5-d57e-407f-81cf-a41d7b78dee3

📥 Commits

Reviewing files that changed from the base of the PR and between 62d9fbe and 3e85d67.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/aime25.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_aime25_benchmark.py

💤 Files with no reviewable changes (1)

tests/unit/accuracy/test_accuracy_config.py

✅ Files skipped from review due to trivial changes (2)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md

🚧 Files skipped from review as they are similar to previous changes (3)

src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_aime25_benchmark.py
src/aiperf/accuracy/benchmarks/aime25.py

Walkthrough

Implements the AIME25 benchmark loader (lighteval-aligned) to load yentinglin/aime_2025, build per-row BenchmarkProblems with bare problem-text user messages and stringified answers, and adds plugin, test, and documentation updates reflecting full implementation.

Changes

AIME25 Benchmark Implementation

Layer / File(s)	Summary
Core AIME25 loader implementation `src/aiperf/accuracy/benchmarks/aime25.py`	Replaces NotImplementedError stub with working lighteval loader. Defines AIME25 constants (dataset name, task name, generation size, schema fields) and implements `load_problems()` to asynchronously load `yentinglin/aime_2025` (train split) and `_build_problems()` to construct per-row BenchmarkProblems with bare problem-text user messages, string-valued ground truth, and generation_size metadata.
Plugin configuration wiring `src/aiperf/plugin/plugins.yaml`	Updates `accuracy_benchmark.aime25` entry to configure lighteval-backed grader (`lighteval_expr` replaces `math`), sets default n-shots to 0, updates description text, and removes the `is_implemented: false` flag.
Comprehensive test coverage `tests/unit/accuracy/test_aime25_benchmark.py`	Adds unit tests verifying bare-text prompt generation (no instruction prefix, exactly one user message), invariance to `n_shots` and `enable_cot`, core loader behavior (one problem per row, correct ground_truth and task, generation_size metadata at 32768), and edge cases (empty dataset, Unicode preservation).
Test configuration update `tests/unit/accuracy/test_accuracy_config.py`	Removes `\"aime25\"` from `STUB_BENCHMARKS` tuple so stub-validation tests no longer expect AIME25 to fail.
Documentation status updates `docs/accuracy/accuracy-benchmarking.md`, `docs/accuracy/accuracy_stubs.md`	Adds `aime25` entry to Available Benchmarks table, moves AIME25 from stubbed to implemented in status summary, updates benchmark tables and implementation counts, and adjusts suggested implementation order.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A benchmark once stubbed, now hops with cheer,
AIME25 leaps from the problem frontier,
Lighteval grading, bare text so bare,
With tests that validate beyond compare!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: implementing an AIME 2025 benchmark with lighteval backing, references the ticket ID, and accurately reflects the changeset's primary objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/aiperf/accuracy/benchmarks/aime25.py (1)
49-49: ⚡ Quick win

Remove type annotation from **kwargs.

The **kwargs: Any annotation should be removed. Based on learnings, variadic keyword arguments should remain untyped unless explicit named parameters are needed.
♻️ Proposed fix
-    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs) -> None:
Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/aime25.py` at line 49, The __init__ signature
for the class in aime25.py annotates variadic keywords as **kwargs: Any; remove
the type annotation and change the signature to use untyped **kwargs so it reads
def __init__(self, run: BenchmarkRun, **kwargs) -> None:, updating any
references to __init__ if they assert the typed form and ensuring no additional
named keyword parameters are required; this removes the unnecessary **kwargs:
Any annotation while preserving behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 328: The docs currently give conflicting guidance for math_500 (which
mirrors AIME24Benchmark) by recommending pairing with lighteval_latex while
earlier listing math_500's default grader as math; update the accuracy_stubs
entry for math_500 to explicitly state that pairing with lighteval_latex is a
planned post-implementation transition (or else change the recommended grader to
match current default 'math') so contributors have a single source of truth;
reference the symbol math_500 and the grader names lighteval_latex and math when
making this clarification.

---

Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/aime25.py`:
- Line 49: The __init__ signature for the class in aime25.py annotates variadic
keywords as **kwargs: Any; remove the type annotation and change the signature
to use untyped **kwargs so it reads def __init__(self, run: BenchmarkRun,
**kwargs) -> None:, updating any references to __init__ if they assert the typed
form and ensuring no additional named keyword parameters are required; this
removes the unnecessary **kwargs: Any annotation while preserving behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0676f619-de8e-4ffa-acb4-481290ad2309

📥 Commits

Reviewing files that changed from the base of the PR and between a90d154 and 62d9fbe.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/aime25.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_aime25_benchmark.py

💤 Files with no reviewable changes (1)

tests/unit/accuracy/test_accuracy_config.py

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927

Draft

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928

Draft

feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

Draft

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:24

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 05ece47 to 80e91dc Compare May 27, 2026 19:41

Base automatically changed from dbermudez/aip-875-implement-aime24-benchmark-loader to main May 29, 2026 00:25

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from b8645f1 to 62d9fbe Compare May 29, 2026 20:48

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 62d9fbe to 3e85d67 Compare May 29, 2026 20:53

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread docs/accuracy/accuracy_stubs.md

debermudez marked this pull request as ready for review May 29, 2026 23:03

dynamo-ops approved these changes May 29, 2026

View reviewed changes

FrankD412 approved these changes May 29, 2026

View reviewed changes

FrankD412 merged commit 1070bf2 into main May 29, 2026
26 checks passed

FrankD412 deleted the dbermudez/aip-876-implement-aime25-benchmark-loader branch May 29, 2026 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
FrankD412 merged 1 commit into
mainfrom
dbermudez/aip-876-implement-aime25-benchmark-loader

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading