Skip to content

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926

Merged
FrankD412 merged 1 commit into
mainfrom
dbermudez/aip-876-implement-aime25-benchmark-loader
May 29, 2026
Merged

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
FrankD412 merged 1 commit into
mainfrom
dbermudez/aip-876-implement-aime25-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

@debermudez debermudez commented May 12, 2026

  • Plugin: registers aime25 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
  • Mirrors AIME24's lighteval-backed structure for the 2025 competition year; same grader (LightevalExprGrader, expr_gold_metric) introduced in AIP-874.
  • Depends on AIP-875 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime25 task)

Summary by CodeRabbit

  • New Features

    • AIME25 benchmark is now implemented and available for evaluation (default grader updated to a lighteval-based grader, default n‑shots = 0).
  • Documentation

    • Benchmark guides updated with AIME25 configuration, dataset source, and grading details; removed AIME25 from stubbed list.
  • Tests

    • Added unit tests validating problem loading, prompt formatting, answers, and edge cases.
  • Chores

    • Benchmark registry updated to mark AIME25 as implemented.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2fe

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2fe

Last updated for commit: 3e85d67Browse code

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 5 of 8 — base branch is dbermudez/aip-875-implement-aime24-benchmark-loader,
depends on #925 (AIP-875) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:24
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 05ece47 to 80e91dc Compare May 27, 2026 19:41
Base automatically changed from dbermudez/aip-875-implement-aime24-benchmark-loader to main May 29, 2026 00:25
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from b8645f1 to 62d9fbe Compare May 29, 2026 20:48
Implement ``AIME25Benchmark`` mirroring the trt-llm benchmark recipe's
``acc_bench_lighteval.py:aime25`` configuration: same
``aime_prompt_fn`` zero-shot rendering, ``generation_size=32768``,
``hf_repo="yentinglin/aime_2025"``. Same shape as ``AIME24Benchmark``
just pointed at the 2025 mirror.

The loader emits one ``BenchmarkProblem`` per dataset row with the
bare problem text as ``prompt``, ``str(answer)`` as ``ground_truth``,
and ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` /
``enable_cot`` are accepted for protocol uniformity but ignored.
Pair with ``LightevalExprGrader`` for the recipe's
``expr_gold_metric`` extraction.

Built on top of AIP-875 (lighteval sub-stack ordering: 875 → 876).
No heavy optional dependency — ``datasets`` is core — so CI gets
100% line + branch coverage out of the box.

Updates the stub registry: drop ``aime25`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented:
false`` from the ``aime25`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_expr``, add the ``aime25`` row to
``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still
Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the
Status Summary, Method Count Summary, and Suggested Implementation
Order accordingly).

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 62d9fbe to 3e85d67 Compare May 29, 2026 20:53
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54115ac5-d57e-407f-81cf-a41d7b78dee3

📥 Commits

Reviewing files that changed from the base of the PR and between 62d9fbe and 3e85d67.

📒 Files selected for processing (6)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • src/aiperf/accuracy/benchmarks/aime25.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_aime25_benchmark.py
💤 Files with no reviewable changes (1)
  • tests/unit/accuracy/test_accuracy_config.py
✅ Files skipped from review due to trivial changes (2)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_aime25_benchmark.py
  • src/aiperf/accuracy/benchmarks/aime25.py

Walkthrough

Implements the AIME25 benchmark loader (lighteval-aligned) to load yentinglin/aime_2025, build per-row BenchmarkProblems with bare problem-text user messages and stringified answers, and adds plugin, test, and documentation updates reflecting full implementation.

Changes

AIME25 Benchmark Implementation

Layer / File(s) Summary
Core AIME25 loader implementation
src/aiperf/accuracy/benchmarks/aime25.py
Replaces NotImplementedError stub with working lighteval loader. Defines AIME25 constants (dataset name, task name, generation size, schema fields) and implements load_problems() to asynchronously load yentinglin/aime_2025 (train split) and _build_problems() to construct per-row BenchmarkProblems with bare problem-text user messages, string-valued ground truth, and generation_size metadata.
Plugin configuration wiring
src/aiperf/plugin/plugins.yaml
Updates accuracy_benchmark.aime25 entry to configure lighteval-backed grader (lighteval_expr replaces math), sets default n-shots to 0, updates description text, and removes the is_implemented: false flag.
Comprehensive test coverage
tests/unit/accuracy/test_aime25_benchmark.py
Adds unit tests verifying bare-text prompt generation (no instruction prefix, exactly one user message), invariance to n_shots and enable_cot, core loader behavior (one problem per row, correct ground_truth and task, generation_size metadata at 32768), and edge cases (empty dataset, Unicode preservation).
Test configuration update
tests/unit/accuracy/test_accuracy_config.py
Removes \"aime25\" from STUB_BENCHMARKS tuple so stub-validation tests no longer expect AIME25 to fail.
Documentation status updates
docs/accuracy/accuracy-benchmarking.md, docs/accuracy/accuracy_stubs.md
Adds aime25 entry to Available Benchmarks table, moves AIME25 from stubbed to implemented in status summary, updates benchmark tables and implementation counts, and adjusts suggested implementation order.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐰 A benchmark once stubbed, now hops with cheer,
AIME25 leaps from the problem frontier,
Lighteval grading, bare text so bare,
With tests that validate beyond compare!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing an AIME 2025 benchmark with lighteval backing, references the ticket ID, and accurately reflects the changeset's primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/aiperf/accuracy/benchmarks/aime25.py (1)

49-49: ⚡ Quick win

Remove type annotation from **kwargs.

The **kwargs: Any annotation should be removed. Based on learnings, variadic keyword arguments should remain untyped unless explicit named parameters are needed.

♻️ Proposed fix
-    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs) -> None:

Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/aime25.py` at line 49, The __init__ signature
for the class in aime25.py annotates variadic keywords as **kwargs: Any; remove
the type annotation and change the signature to use untyped **kwargs so it reads
def __init__(self, run: BenchmarkRun, **kwargs) -> None:, updating any
references to __init__ if they assert the typed form and ensuring no additional
named keyword parameters are required; this removes the unnecessary **kwargs:
Any annotation while preserving behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 328: The docs currently give conflicting guidance for math_500 (which
mirrors AIME24Benchmark) by recommending pairing with lighteval_latex while
earlier listing math_500's default grader as math; update the accuracy_stubs
entry for math_500 to explicitly state that pairing with lighteval_latex is a
planned post-implementation transition (or else change the recommended grader to
match current default 'math') so contributors have a single source of truth;
reference the symbol math_500 and the grader names lighteval_latex and math when
making this clarification.

---

Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/aime25.py`:
- Line 49: The __init__ signature for the class in aime25.py annotates variadic
keywords as **kwargs: Any; remove the type annotation and change the signature
to use untyped **kwargs so it reads def __init__(self, run: BenchmarkRun,
**kwargs) -> None:, updating any references to __init__ if they assert the typed
form and ensuring no additional named keyword parameters are required; this
removes the unnecessary **kwargs: Any annotation while preserving behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0676f619-de8e-4ffa-acb4-481290ad2309

📥 Commits

Reviewing files that changed from the base of the PR and between a90d154 and 62d9fbe.

📒 Files selected for processing (6)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • src/aiperf/accuracy/benchmarks/aime25.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_aime25_benchmark.py
💤 Files with no reviewable changes (1)
  • tests/unit/accuracy/test_accuracy_config.py

Comment thread docs/accuracy/accuracy_stubs.md
@debermudez debermudez marked this pull request as ready for review May 29, 2026 23:03
@FrankD412 FrankD412 merged commit 1070bf2 into main May 29, 2026
26 checks passed
@FrankD412 FrankD412 deleted the dbermudez/aip-876-implement-aime25-benchmark-loader branch May 29, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants