|
7 | 7 |
|
8 | 8 | This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected. |
9 | 9 |
|
10 | | -**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. |
| 10 | +**Status summary:** With the AIME24 loader landing on top of the BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, and `AIME24Benchmark` are fully implemented; the remaining benchmarks (`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. |
11 | 11 |
|
12 | 12 | ## Table of Contents |
13 | 13 |
|
@@ -173,16 +173,16 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method. |
173 | 173 | | 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. | |
174 | 174 | | 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). | |
175 | 175 | | 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). | |
| 176 | +| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. | |
176 | 177 |
|
177 | 178 | ### Still Stubbed |
178 | 179 |
|
179 | 180 | | # | Class | File | Plugin Key | Default Grader | Default N-Shots | |
180 | 181 | |---|-------|------|------------|----------------|-----------------| |
181 | | -| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 | |
182 | | -| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 | |
183 | | -| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 | |
184 | | -| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | |
185 | | -| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | |
| 182 | +| 1 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 | |
| 183 | +| 2 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 | |
| 184 | +| 3 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | |
| 185 | +| 4 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | |
186 | 186 |
|
187 | 187 | **Each benchmark has 1 method to implement:** |
188 | 188 |
|
@@ -309,24 +309,24 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu |
309 | 309 | | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods | |
310 | 310 | |-----------|-------------|---------------|------------------|-------------------| |
311 | 311 | | Graders | 7 (all) | 0 | — | 0 | |
312 | | -| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 | |
| 312 | +| Benchmarks | 5 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) | 4 | 1 (`load_problems`) | 4 | |
313 | 313 | | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 | |
314 | 314 | | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 | |
315 | 315 | | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 | |
316 | 316 | | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 | |
317 | 317 | | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 | |
318 | | -| **Total** | **15** | **6** | | **6** | |
| 318 | +| **Total** | **16** | **5** | | **5** | |
319 | 319 |
|
320 | 320 | ### Self-Disabling Pattern |
321 | 321 |
|
322 | 322 | Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`. |
323 | 323 |
|
324 | 324 | ### Suggested Implementation Order |
325 | 325 |
|
326 | | -The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches: |
| 326 | +The processors, exporters, all graders, and five benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) are already wired end-to-end. The remaining work is the four stub benchmarks; mirror the existing loader whose grader matches: |
327 | 327 |
|
328 | | -1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader. |
329 | | -2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader. |
| 328 | +1. **`aime25`, `math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_expr` (aime25) or `lighteval_latex` (math_500). |
| 329 | +2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader. |
330 | 330 | 3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader. |
331 | 331 | 4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported. |
332 | 332 |
|
|
0 commit comments