Skip to content

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927

Draft
debermudez wants to merge 1 commit into
dbermudez/aip-876-implement-aime25-benchmark-loaderfrom
dbermudez/aip-879-implement-math500-benchmark-loader
Draft

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927
debermudez wants to merge 1 commit into
dbermudez/aip-876-implement-aime25-benchmark-loaderfrom
dbermudez/aip-879-implement-math500-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

  • Plugin: registers math_500 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
  • 500 curated math problems spanning algebra, geometry, number theory, and combinatorics; gold answers are LaTeX snippets (e.g. \frac{1}{3}, \sqrt{2}).
  • Uses LightevalLatexGrader (lighteval_latex, latex_gold_metric, LatexExtractionConfig on the gold side) introduced in AIP-874.
  • Depends on AIP-876 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (math_500 task)

…ow-up)

Switches MATH-500 to lighteval-aligned grading per the trt-llm
benchmark recipe's acc_bench_lighteval.py:math_500. Loader emits the
bare problem text as the user message; ground_truth is the full
solution (containing the boxed answer); LightevalLatexGrader extracts
the boxed expression at grade time. Per-row `subject` becomes the
task name so the accuracy CSV breaks down by MATH subject.

plugins.yaml: `math_500.default_grader`: math → lighteval_latex
generation_size: 32768.

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c2ac1d4697137da720a6a4c9dba690767a0e39e7

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c2ac1d4697137da720a6a4c9dba690767a0e39e7

Last updated for commit: c2ac1d4Browse code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 6 of 8 — base branch is dbermudez/aip-876-implement-aime25-benchmark-loader,
depends on #926 (AIP-876) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:25
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from e4b5d58 to 28c1485 Compare May 12, 2026 23:25
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 28c1485 to 9e0491c Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 9e0491c to 3770e19 Compare May 13, 2026 21:23
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 3770e19 to c2ac1d4 Compare May 13, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant