Add APEX Shortlist benchmark by gwarmstrong · Pull Request #1105 · NVIDIA-NeMo/Gym

gwarmstrong · 2026-04-21T21:32:13Z

Add APEX Shortlist benchmark

Migrates the apex-shortlist benchmark from NeMo Skills into Gym on top of the existing math_with_judge resource server. Verification uses the server's symbolic-only path (math-verify, should_use_judge: false) with a new opt-in parse_reasoning_like_skills flag that mirrors Skills' parse_reasoning=True + brace-matched \boxed{…} extraction — needed to avoid spurious mid-reasoning extractions on truncated generations.

Includes

benchmarks/apex_shortlist/ — benchmark config, prepare.py, prompt template
- Data source: MathArena/apex-shortlist on HuggingFace (48 problems, 32 integer + 16 symbolic answers)
resources_servers/math_with_judge/ — extended (not new)
- _search_boxed brace-matching extractor (mirrors nemo_skills.evaluation.math_grader.search_boxed) and prefers the raw \boxed{…} LaTeX over math-verify's normalized form as judge input
- _strip_think_tags + skills_parity_mode flag that routes rollouts through Skills' full judge pipeline (parse_reasoning → search_boxed → prefill shortcuts → LLM judge) for per-rollout parity
- parse_reasoning_like_skills flag (new on this branch): applies the same extraction to the symbolic-only path (no judge), for benchmarks whose Skills config is eval_type=math + should_use_judge=false

Validated against NeMo Skills

Single comparison run on draco-oci: 48 problems × 4 rollouts/task, Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536. Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node DP vLLM (TP=8, DP=4, Ray-coordinated) with +num_repeats=4.

===========================================================================
eval_type=math (symbolic-only, math-verify) | 4 rollouts/task | T=1.0 top_p=0.95
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
===========================================================================
Metric                        Skills         Gym         Delta
---------------------------------------------------------------
pass@1[avg-of-4]               32.3%         34.4%        +2.1%
majority@4                     40.5%         41.7%        +1.2%
pass@4                         56.3%         54.2%        -2.1%
no_answer@1[avg-of-4]          26.6%         29.7%        +3.1%

Migrates NeMo Skills' `apex-shortlist` benchmark into Gym. Reuses the `math_with_judge` resource server in symbolic-only mode (`should_use_judge: false`) to mirror Skills' `eval_type=math` default. Prompt is a character-for-character copy of Skills' `generic/math.yaml`. Data source: MathArena/apex-shortlist on HuggingFace (48 problems, mix of integer and symbolic answers). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…s judge input math-verify is tuned for numeric/algebraic answers; it silently mangles non-numeric \boxed content (function definitions, sets, conditions) into a degenerate fragment during extraction. For example: model wrote: \boxed{g(x)=2x^{3}+C \text{ or } g(x)=-2x^{3}+C, C\in\mathbb{R}} math-verify extracted: "2" When the symbolic-first cascade falls through to the LLM judge on such cases, the judge saw "2" and (correctly) rejected it, even though the model's actual answer was the full function definition. The fix: when sending the answer to the judge, prefer the raw LaTeX inside the model's last balanced \boxed{...} (or \fbox{...}) over math-verify's normalized form. Fall back to math-verify's extraction when no balanced \boxed is present, then to the full generation — the prior behavior — if neither is available. Implementation is a small, local change to _verify_answer and a new brace-matching _search_boxed helper that mirrors nemo_skills.evaluation.math_grader.search_boxed so Gym can pass the same LaTeX a Skills baseline would pass to its judge. Does NOT change: - extracted_answer field (still math-verify's output, used for majority-vote grouping and the symbolic_accuracy diagnostic) - library_reward / symbolic_accuracy semantics - behavior when symbolic passes (judge remains skipped as before) Discovered on imo-answerbench parity comparison: 107/1600 rollouts fell into this bucket and were flipped by the judge on Gym but judged correctly on Skills, driving a 5pp pass@1 gap that was not a migration bug. Tests: 9 new (TestSearchBoxed, TestRawBoxedJudgeInput) + 21 existing still pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

New config flag (default False) that mirrors NeMo Skills' judge pipeline end-to-end, so Gym's per-rollout reward matches Skills' `judge_correct` verdict 1:1 — not just the aggregate average. When skills_parity_mode=True: 1. Strip <think>...</think> / <thinking>...</thinking> from the raw model output (mirrors Skills' parse_reasoning=True). Also strips pre-</think> content when only a closing tag survives, which is common when the reasoning parser eats the opening tag. 2. Extract the last balanced \boxed{...} content via the _search_boxed brace-matcher (not math-verify). The raw LaTeX the model actually wrote becomes `extracted_answer`, matching Skills' `predicted_answer` field used for majority@k grouping. 3. Prefill shortcuts, mirroring nemo_skills.utils.prefill_judgement: empty extraction -> reward 0, no LLM call extracted.strip() == expected.strip() -> reward 1, no LLM call 4. Otherwise call the LLM judge unconditionally. Symbolic success does NOT short-circuit the judge in this mode — Skills' metric doesn't either. 5. `reward` == final judge verdict (prefill or LLM), not `library_reward OR judge`. `library_reward` is still computed so the symbolic_accuracy diagnostic remains available. `judge_evaluations` semantics: populated list for LLM calls, empty list (not None) for prefill shortcuts. This keeps the existing `_math_score_fn` logic working — it already gates judge_accuracy on `judge_evaluations is not None`, and "prefill fired" IS a judge verdict. Defaults preserved for every existing consumer. Only benchmarks/imo_answerbench/config.yaml flips it on. Tests: 9 new in TestSkillsParityMode + existing 30 tests still pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…ly parity New config flag (default False) that applies Skills' parse_reasoning=True + search_boxed extraction to the symbolic-only (no-judge) verification path — for benchmarks whose Skills config is `eval_type=math` with `should_use_judge=false` (apex-shortlist, aime25, hmmt_feb25, etc.). When parse_reasoning_like_skills=True: 1. Strip <think>...</think> (and pre-</think> content) from the raw model output before extraction. 2. Extract the last balanced \\boxed{...} content via _search_boxed (brace-matcher from nemo_skills.evaluation.math_grader). 3. If extraction returns None, reward=0 — mirrors Skills' predicted_answer=None -> no_answer. 4. Otherwise run math-verify on the raw \\boxed{<extracted>} as before. Unlike skills_parity_mode, this does NOT route through the LLM judge — it preserves the symbolic-only reward signal that pure-math benchmarks expect. library_reward == reward (the symbolic_accuracy diagnostic stays aligned with the headline metric). Also declare the flag in resources_servers/math_with_judge/configs/ math_with_judge.yaml as `false`, so benchmark configs inheriting via `_inherit_from` can override it without tripping OmegaConf struct-mode "Key 'parse_reasoning_like_skills' is not in struct" errors. Tests: 5 new in TestParseReasoningLikeSkills covering paired <think>...</think>, pre-</think>-only, truncated (no close tag), plain \\boxed without think tags, and wrong \\boxed content. Discovered on apex-shortlist parity comparison: the original port produced 37.5% pass@1 vs Skills' 32.3% and 1% no_answer vs 26.6%. Both gaps trace to math-verify extracting spurious \\boxed{} (and even numeric substrings) from mid-reasoning content that Skills' extractor would have dropped. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

- `judge_prompt_path`, `judge_equal_label`, `judge_not_equal_label`, and `judge_bidirectional` were pulled into the config schema by the `skills_parity_mode` cherry-pick's context lines, but no code path on this branch reads them — they originate from an imo_answerbench commit (bb6f8c2) that was not cherry-picked here. Shipping them would silently accept overrides (e.g. `judge_prompt_path: ...`) that would have no effect. - TestSkillsParityMode._build() no longer takes a `bidirectional` kwarg; nothing in the test file passes True to it anyway. - Drop the `num_repeats here is IGNORED` comment from benchmarks/apex_shortlist/config.yaml — same information lives in the run script (`+num_repeats=4`) and the `migrate-benchmark` skill notes, and the raw config is not a great place to document CLI override semantics. All 198 math_with_judge tests still pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Gym dropped few-shot prompt support project-wide many PRs ago; the notes about 'no few-shots' / '{examples} empty by default' are stale context from the Skills source. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

copy-pr-bot · 2026-04-21T21:32:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- test_llm_judge_runs_when_no_prefill_regardless_of_symbolic assumed a unidirectional-judge path that relied on judge_bidirectional=False. That config field was removed in the previous commit (it was never read by any code path on this branch). _verify_answer_with_judge always runs bidirectional, so when the first-order judge says equal, it runs a second-order cross-check — expect 2 judge calls and a 2-element judge_evaluations list. - test_truncated_no_close_think_returns_no_answer encoded a wrong mental model: `_strip_think_tags` only strips when </think> is present, so "First step \boxed{99} more work" (no tags at all) passes through unchanged and _search_boxed legitimately extracts "99". Replace with test_no_boxed_returns_no_answer, which exercises the real no-answer path: model output with paired <think>…</think> but no \boxed{} after → _search_boxed returns None → reward=0. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…instead The parse_reasoning_like_skills flag was added earlier on this branch to handle <think>...</think> asymmetries between Skills and Gym on APEX Shortlist's parity comparison. Empirically it worked (pass@1 delta shrunk from +5.2 pp to +2.1 pp), but it moved the think-tag surgery into the resource server. The cleaner fix is to have vLLM strip reasoning at the model-output layer via `--reasoning-parser deepseek_r1` (for Nemotron) on both Skills' and Gym's vLLM servers. That eliminates the asymmetry upstream, so neither side needs post-hoc string surgery. A/B validated on apex-shortlist: pass@1 delta drops further from +2.1 pp to -0.5 pp and cross-seed std_dev on the Gym side collapses from 4.96 to 1.70 pp. Recipe scripts (run_apex_shortlist_{ns,gym}.py) now add `--reasoning-parser deepseek_r1` to server_args on both sides. Removed: - `parse_reasoning_like_skills: bool = False` from LibraryJudgeMathResourcesServerConfig - The code branch reading that flag in _verify_answer - `parse_reasoning_like_skills: false` declaration in resources_servers/math_with_judge/configs/math_with_judge.yaml (it was only there to allow struct-mode overrides via _inherit_from, which no longer occurs) - `parse_reasoning_like_skills: true` from benchmarks/apex_shortlist/config.yaml - TestParseReasoningLikeSkills class (5 tests) Kept: - `_search_boxed` + `_strip_think_tags` + skills_parity_mode flag (cherry-picked from migrate-gym-imo-answerbench). These remain useful for LLM-judge benchmarks that can't rely on vLLM's reasoning parser doing the stripping. All 198 math_with_judge ng_dev_test tests still pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…g-parser) The skills_parity_mode path was cherry-picked earlier to let Gym mirror Skills' end-to-end judge pipeline for per-rollout parity on LLM-judge benchmarks: strip <think>...</think>, extract via _search_boxed, run prefill shortcuts, always call the judge bidirectionally. The first three steps all assume Gym has to do think-tag surgery at the resource-server layer. With vLLM's --reasoning-parser configured on the model server (the recommended default per the migrate-benchmark skill), think-tags are stripped at the model-output layer before they ever reach the resource server — so the surgery is a no-op. And the prefill-shortcut + always-bidirectional-judge semantics belong with the LLM-judge benchmarks that exercise them (tracked on the migrate-gym-imo-answerbench branch), not here. Removed: - `skills_parity_mode: bool = False` config field - `if self.config.skills_parity_mode: ...` dispatch in _verify_answer - `_strip_think_tags` classmethod + the four `_THINK_*_RE` regex ClassVars - `_verify_answer_skills_parity` coroutine - `import re` (no longer used anywhere in this file) - `TestSkillsParityMode` class Kept: - `_search_boxed` + its use in _verify_answer for judge-input preference over math-verify's normalized form. That fix is about mangled judge input on non-numeric \boxed{...} content and is independent of think-tag handling. 202 math_with_judge tests still pass via ng_dev_test. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

cmunley1 · 2026-04-24T15:12:18Z

+        # to math-verify's extraction preserves the prior behavior when
+        # no balanced \boxed{} is present; ultimately falls back to the
+        # full generation if neither is available.
+        raw_boxed = self._search_boxed(generated_answer)


do you know if this impacts other environments using this verifier?

I just now removed the judge changes after validating that I could fully resolve the Skills/Gym differences by using a reasoning parser. This was just an issue when \boxed inside of reasoning traces was triggering the grader.

…action as judge input" This reverts commit 8b98293. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong added 6 commits April 20, 2026 14:12

docs(apex_shortlist): drop few-shot references

f897c6b

Gym dropped few-shot prompt support project-wide many PRs ago; the notes about 'no few-shots' / '{examples} empty by default' are stale context from the Skills source. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong requested review from bxyu-nvidia and cmunley1 April 21, 2026 21:40

gwarmstrong and others added 4 commits April 21, 2026 14:56

Merge branch 'main' into georgea/migrate-gym-apex-shortlist

a61338d

Merge branch 'main' into georgea/migrate-gym-apex-shortlist

2b6f64e

gwarmstrong commented Apr 22, 2026

View reviewed changes

Comment thread benchmarks/apex_shortlist/config.yaml Outdated

gwarmstrong commented Apr 22, 2026

View reviewed changes

Comment thread benchmarks/apex_shortlist/config.yaml Outdated

Apply suggestions from code review

f001230

Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong force-pushed the georgea/migrate-gym-apex-shortlist branch from f98272e to f001230 Compare April 22, 2026 22:15

cmunley1 reviewed Apr 24, 2026

View reviewed changes

Revert "math_with_judge: prefer raw \boxed{...} over math-verify extr…

11ebb11

…action as judge input" This reverts commit 8b98293. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong force-pushed the georgea/migrate-gym-apex-shortlist branch from 4dbeea0 to 11ebb11 Compare April 24, 2026 21:08

Merge branch 'main' into georgea/migrate-gym-apex-shortlist

34a86c8

gwarmstrong requested a review from cmunley1 April 24, 2026 21:10

cmunley1 approved these changes Apr 25, 2026

View reviewed changes

cmunley1 merged commit 2b71315 into NVIDIA-NeMo:main Apr 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add APEX Shortlist benchmark#1105

Add APEX Shortlist benchmark#1105
cmunley1 merged 14 commits intoNVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-apex-shortlist

gwarmstrong commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

cmunley1 Apr 24, 2026

Uh oh!

gwarmstrong Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gwarmstrong commented Apr 21, 2026

Add APEX Shortlist benchmark

Includes

Validated against NeMo Skills

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

cmunley1 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants