feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) by debermudez · Pull Request #926 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:07:27Z

Plugin: registers aime25 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
Mirrors AIME24's lighteval-backed structure for the 2025 competition year; same grader (LightevalExprGrader, expr_gold_metric) introduced in AIP-874.
Depends on AIP-875 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime25 task)

…-up) Same lighteval alignment as AIME24, but with the yentinglin/aime_2025 dataset. Bare problem text as user message, generation_size=32768, default_grader=lighteval_expr. Tests pin the same invariants (prompt is bare problem text, n_shots/enable_cot ignored). Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:142 Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

copy-pr-bot · 2026-05-12T23:07:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:07:41Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60e

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60e

Last updated for commit: b8645f1 • Browse code

github-actions · 2026-05-12T23:08:08Z

Fern Docs Preview: https://nvidia-preview-a5c9c167-c63b-495d-8ddb-b366c3993d11.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:11:27Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 5 of 8 — base branch is dbermudez/aip-875-implement-aime24-benchmark-loader,
depends on #925 (AIP-875) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:36:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927

Draft

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928

Draft

feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

Draft

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:24

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926

feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
debermudez wants to merge 1 commit into
dbermudez/aip-875-implement-aime24-benchmark-loaderfrom
dbermudez/aip-876-implement-aime25-benchmark-loader

debermudez commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

debermudez commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading