feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
Draft
debermudez wants to merge 1 commit into
Conversation
…-up) Same lighteval alignment as AIME24, but with the yentinglin/aime_2025 dataset. Bare problem text as user message, generation_size=32768, default_grader=lighteval_expr. Tests pin the same invariants (prompt is bare problem text, n_shots/enable_cot ignored). Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:142 Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60eRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b8645f16a6caf24e8eb897c46d2bc23a9284a60eLast updated for commit: |
This was referenced May 12, 2026
Contributor
Author
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 5 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
e0576be to
cd239a5
Compare
358d5bd to
9bbe752
Compare
cd239a5 to
ed0edf6
Compare
9bbe752 to
599d8f4
Compare
ed0edf6 to
9eabc25
Compare
599d8f4 to
d7552b6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
d7552b6 to
b8645f1
Compare
9eabc25 to
05ece47
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
aime25benchmark inplugins.yaml(default_grader: math,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full lighteval-backed implementation lands.LightevalExprGrader,expr_gold_metric) introduced in AIP-874.Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py(aime25task)