feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) by debermudez · Pull Request #927 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:07:41Z

Plugin: registers math_500 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
500 curated math problems spanning algebra, geometry, number theory, and combinatorics; gold answers are LaTeX snippets (e.g. \frac{1}{3}, \sqrt{2}).
Uses LightevalLatexGrader (lighteval_latex, latex_gold_metric, LatexExtractionConfig on the gold side) introduced in AIP-874.
Depends on AIP-876 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (math_500 task)

…ow-up) Switches MATH-500 to lighteval-aligned grading per the trt-llm benchmark recipe's acc_bench_lighteval.py:math_500. Loader emits the bare problem text as the user message; ground_truth is the full solution (containing the boxed answer); LightevalLatexGrader extracts the boxed expression at grade time. Per-row `subject` becomes the task name so the accuracy CSV breaks down by MATH subject. plugins.yaml: `math_500.default_grader`: math → lighteval_latex generation_size: 32768. Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156 Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

copy-pr-bot · 2026-05-12T23:07:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:07:52Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c2ac1d4697137da720a6a4c9dba690767a0e39e7

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c2ac1d4697137da720a6a4c9dba690767a0e39e7

Last updated for commit: c2ac1d4 • Browse code

github-actions · 2026-05-12T23:08:24Z

Fern Docs Preview: https://nvidia-preview-221664ca-af9b-4b14-8ed3-be7826d0cf17.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:11:33Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 6 of 8 — base branch is dbermudez/aip-876-implement-aime25-benchmark-loader,
depends on #926 (AIP-876) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:36:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928

Draft

feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

Draft

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:25

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from e4b5d58 to 28c1485 Compare May 12, 2026 23:25

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 28c1485 to 9e0491c Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 9e0491c to 3770e19 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 3770e19 to c2ac1d4 Compare May 13, 2026 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927
debermudez wants to merge 1 commit into
dbermudez/aip-876-implement-aime25-benchmark-loaderfrom
dbermudez/aip-879-implement-math500-benchmark-loader

debermudez commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

debermudez commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading