Skip to content

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926

Draft
debermudez wants to merge 1 commit into
dbermudez/aip-875-implement-aime24-benchmark-loaderfrom
dbermudez/aip-876-implement-aime25-benchmark-loader
Draft

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
debermudez wants to merge 1 commit into
dbermudez/aip-875-implement-aime24-benchmark-loaderfrom
dbermudez/aip-876-implement-aime25-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

  • Plugin: registers aime25 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
  • Mirrors AIME24's lighteval-backed structure for the 2025 competition year; same grader (LightevalExprGrader, expr_gold_metric) introduced in AIP-874.
  • Depends on AIP-875 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime25 task)

…-up)

Same lighteval alignment as AIME24, but with the yentinglin/aime_2025
dataset. Bare problem text as user message, generation_size=32768,
default_grader=lighteval_expr. Tests pin the same invariants
(prompt is bare problem text, n_shots/enable_cot ignored).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:142

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60e

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60e

Last updated for commit: b8645f1Browse code

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 5 of 8 — base branch is dbermudez/aip-875-implement-aime24-benchmark-loader,
depends on #925 (AIP-875) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:24
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant