Add APEX Shortlist benchmark#1105
Merged
cmunley1 merged 14 commits intoNVIDIA-NeMo:mainfrom Apr 25, 2026
Merged
Conversation
Migrates NeMo Skills' `apex-shortlist` benchmark into Gym. Reuses the `math_with_judge` resource server in symbolic-only mode (`should_use_judge: false`) to mirror Skills' `eval_type=math` default. Prompt is a character-for-character copy of Skills' `generic/math.yaml`. Data source: MathArena/apex-shortlist on HuggingFace (48 problems, mix of integer and symbolic answers). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…s judge input
math-verify is tuned for numeric/algebraic answers; it silently mangles
non-numeric \boxed content (function definitions, sets, conditions) into
a degenerate fragment during extraction. For example:
model wrote: \boxed{g(x)=2x^{3}+C \text{ or } g(x)=-2x^{3}+C, C\in\mathbb{R}}
math-verify extracted: "2"
When the symbolic-first cascade falls through to the LLM judge on such
cases, the judge saw "2" and (correctly) rejected it, even though the
model's actual answer was the full function definition.
The fix: when sending the answer to the judge, prefer the raw LaTeX
inside the model's last balanced \boxed{...} (or \fbox{...}) over
math-verify's normalized form. Fall back to math-verify's extraction
when no balanced \boxed is present, then to the full generation — the
prior behavior — if neither is available.
Implementation is a small, local change to _verify_answer and a new
brace-matching _search_boxed helper that mirrors
nemo_skills.evaluation.math_grader.search_boxed so Gym can pass the
same LaTeX a Skills baseline would pass to its judge.
Does NOT change:
- extracted_answer field (still math-verify's output, used for
majority-vote grouping and the symbolic_accuracy diagnostic)
- library_reward / symbolic_accuracy semantics
- behavior when symbolic passes (judge remains skipped as before)
Discovered on imo-answerbench parity comparison: 107/1600 rollouts
fell into this bucket and were flipped by the judge on Gym but judged
correctly on Skills, driving a 5pp pass@1 gap that was not a migration
bug.
Tests: 9 new (TestSearchBoxed, TestRawBoxedJudgeInput) + 21 existing
still pass.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
New config flag (default False) that mirrors NeMo Skills' judge pipeline
end-to-end, so Gym's per-rollout reward matches Skills' `judge_correct`
verdict 1:1 — not just the aggregate average.
When skills_parity_mode=True:
1. Strip <think>...</think> / <thinking>...</thinking> from the raw
model output (mirrors Skills' parse_reasoning=True). Also strips
pre-</think> content when only a closing tag survives, which is
common when the reasoning parser eats the opening tag.
2. Extract the last balanced \boxed{...} content via the
_search_boxed brace-matcher (not math-verify). The raw LaTeX the
model actually wrote becomes `extracted_answer`, matching Skills'
`predicted_answer` field used for majority@k grouping.
3. Prefill shortcuts, mirroring nemo_skills.utils.prefill_judgement:
empty extraction -> reward 0, no LLM call
extracted.strip() == expected.strip() -> reward 1, no LLM call
4. Otherwise call the LLM judge unconditionally. Symbolic success does
NOT short-circuit the judge in this mode — Skills' metric doesn't
either.
5. `reward` == final judge verdict (prefill or LLM), not
`library_reward OR judge`. `library_reward` is still computed so
the symbolic_accuracy diagnostic remains available.
`judge_evaluations` semantics: populated list for LLM calls, empty list
(not None) for prefill shortcuts. This keeps the existing
`_math_score_fn` logic working — it already gates judge_accuracy on
`judge_evaluations is not None`, and "prefill fired" IS a judge verdict.
Defaults preserved for every existing consumer. Only
benchmarks/imo_answerbench/config.yaml flips it on.
Tests: 9 new in TestSkillsParityMode + existing 30 tests still pass.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…ly parity
New config flag (default False) that applies Skills' parse_reasoning=True
+ search_boxed extraction to the symbolic-only (no-judge) verification
path — for benchmarks whose Skills config is `eval_type=math` with
`should_use_judge=false` (apex-shortlist, aime25, hmmt_feb25, etc.).
When parse_reasoning_like_skills=True:
1. Strip <think>...</think> (and pre-</think> content) from the raw
model output before extraction.
2. Extract the last balanced \\boxed{...} content via _search_boxed
(brace-matcher from nemo_skills.evaluation.math_grader).
3. If extraction returns None, reward=0 — mirrors Skills'
predicted_answer=None -> no_answer.
4. Otherwise run math-verify on the raw \\boxed{<extracted>} as before.
Unlike skills_parity_mode, this does NOT route through the LLM judge —
it preserves the symbolic-only reward signal that pure-math benchmarks
expect. library_reward == reward (the symbolic_accuracy diagnostic
stays aligned with the headline metric).
Also declare the flag in resources_servers/math_with_judge/configs/
math_with_judge.yaml as `false`, so benchmark configs inheriting via
`_inherit_from` can override it without tripping OmegaConf struct-mode
"Key 'parse_reasoning_like_skills' is not in struct" errors.
Tests: 5 new in TestParseReasoningLikeSkills covering paired
<think>...</think>, pre-</think>-only, truncated (no close tag), plain
\\boxed without think tags, and wrong \\boxed content.
Discovered on apex-shortlist parity comparison: the original port
produced 37.5% pass@1 vs Skills' 32.3% and 1% no_answer vs 26.6%. Both
gaps trace to math-verify extracting spurious \\boxed{} (and even
numeric substrings) from mid-reasoning content that Skills' extractor
would have dropped.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
- `judge_prompt_path`, `judge_equal_label`, `judge_not_equal_label`, and `judge_bidirectional` were pulled into the config schema by the `skills_parity_mode` cherry-pick's context lines, but no code path on this branch reads them — they originate from an imo_answerbench commit (bb6f8c2) that was not cherry-picked here. Shipping them would silently accept overrides (e.g. `judge_prompt_path: ...`) that would have no effect. - TestSkillsParityMode._build() no longer takes a `bidirectional` kwarg; nothing in the test file passes True to it anyway. - Drop the `num_repeats here is IGNORED` comment from benchmarks/apex_shortlist/config.yaml — same information lives in the run script (`+num_repeats=4`) and the `migrate-benchmark` skill notes, and the raw config is not a great place to document CLI override semantics. All 198 math_with_judge tests still pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Gym dropped few-shot prompt support project-wide many PRs ago; the
notes about 'no few-shots' / '{examples} empty by default' are stale
context from the Skills source.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
- test_llm_judge_runs_when_no_prefill_regardless_of_symbolic assumed a
unidirectional-judge path that relied on judge_bidirectional=False.
That config field was removed in the previous commit (it was never
read by any code path on this branch). _verify_answer_with_judge
always runs bidirectional, so when the first-order judge says
equal, it runs a second-order cross-check — expect 2 judge calls
and a 2-element judge_evaluations list.
- test_truncated_no_close_think_returns_no_answer encoded a wrong
mental model: `_strip_think_tags` only strips when </think> is
present, so "First step \boxed{99} more work" (no tags at all)
passes through unchanged and _search_boxed legitimately extracts
"99". Replace with test_no_boxed_returns_no_answer, which exercises
the real no-answer path: model output with paired <think>…</think>
but no \boxed{} after → _search_boxed returns None → reward=0.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…instead
The parse_reasoning_like_skills flag was added earlier on this branch to
handle <think>...</think> asymmetries between Skills and Gym on APEX
Shortlist's parity comparison. Empirically it worked (pass@1 delta
shrunk from +5.2 pp to +2.1 pp), but it moved the think-tag surgery
into the resource server.
The cleaner fix is to have vLLM strip reasoning at the model-output
layer via `--reasoning-parser deepseek_r1` (for Nemotron) on both
Skills' and Gym's vLLM servers. That eliminates the asymmetry
upstream, so neither side needs post-hoc string surgery. A/B
validated on apex-shortlist: pass@1 delta drops further from +2.1 pp
to -0.5 pp and cross-seed std_dev on the Gym side collapses from
4.96 to 1.70 pp.
Recipe scripts (run_apex_shortlist_{ns,gym}.py) now add
`--reasoning-parser deepseek_r1` to server_args on both sides.
Removed:
- `parse_reasoning_like_skills: bool = False` from
LibraryJudgeMathResourcesServerConfig
- The code branch reading that flag in _verify_answer
- `parse_reasoning_like_skills: false` declaration in
resources_servers/math_with_judge/configs/math_with_judge.yaml
(it was only there to allow struct-mode overrides via
_inherit_from, which no longer occurs)
- `parse_reasoning_like_skills: true` from
benchmarks/apex_shortlist/config.yaml
- TestParseReasoningLikeSkills class (5 tests)
Kept:
- `_search_boxed` + `_strip_think_tags` + skills_parity_mode flag
(cherry-picked from migrate-gym-imo-answerbench). These remain
useful for LLM-judge benchmarks that can't rely on vLLM's
reasoning parser doing the stripping.
All 198 math_with_judge ng_dev_test tests still pass.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
gwarmstrong
commented
Apr 22, 2026
gwarmstrong
commented
Apr 22, 2026
Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
f98272e to
f001230
Compare
…g-parser)
The skills_parity_mode path was cherry-picked earlier to let Gym mirror
Skills' end-to-end judge pipeline for per-rollout parity on LLM-judge
benchmarks: strip <think>...</think>, extract via _search_boxed, run
prefill shortcuts, always call the judge bidirectionally.
The first three steps all assume Gym has to do think-tag surgery at the
resource-server layer. With vLLM's --reasoning-parser configured on the
model server (the recommended default per the migrate-benchmark skill),
think-tags are stripped at the model-output layer before they ever
reach the resource server — so the surgery is a no-op. And the
prefill-shortcut + always-bidirectional-judge semantics belong with
the LLM-judge benchmarks that exercise them (tracked on the
migrate-gym-imo-answerbench branch), not here.
Removed:
- `skills_parity_mode: bool = False` config field
- `if self.config.skills_parity_mode: ...` dispatch in _verify_answer
- `_strip_think_tags` classmethod + the four `_THINK_*_RE` regex
ClassVars
- `_verify_answer_skills_parity` coroutine
- `import re` (no longer used anywhere in this file)
- `TestSkillsParityMode` class
Kept:
- `_search_boxed` + its use in _verify_answer for judge-input
preference over math-verify's normalized form. That fix is about
mangled judge input on non-numeric \boxed{...} content and is
independent of think-tag handling.
202 math_with_judge tests still pass via ng_dev_test.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
cmunley1
reviewed
Apr 24, 2026
| # to math-verify's extraction preserves the prior behavior when | ||
| # no balanced \boxed{} is present; ultimately falls back to the | ||
| # full generation if neither is available. | ||
| raw_boxed = self._search_boxed(generated_answer) |
Contributor
There was a problem hiding this comment.
do you know if this impacts other environments using this verifier?
Contributor
Author
There was a problem hiding this comment.
I just now removed the judge changes after validating that I could fully resolve the Skills/Gym differences by using a reasoning parser. This was just an issue when \boxed inside of reasoning traces was triggering the grader.
…action as judge input" This reverts commit 8b98293. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
4dbeea0 to
11ebb11
Compare
cmunley1
approved these changes
Apr 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add APEX Shortlist benchmark
Migrates the
apex-shortlistbenchmark from NeMo Skills into Gym on top of the existingmath_with_judgeresource server. Verification uses the server's symbolic-only path (math-verify,should_use_judge: false) with a new opt-inparse_reasoning_like_skillsflag that mirrors Skills'parse_reasoning=True+ brace-matched\boxed{…}extraction — needed to avoid spurious mid-reasoning extractions on truncated generations.Includes
benchmarks/apex_shortlist/— benchmark config, prepare.py, prompt templateMathArena/apex-shortliston HuggingFace (48 problems, 32 integer + 16 symbolic answers)resources_servers/math_with_judge/— extended (not new)_search_boxedbrace-matching extractor (mirrorsnemo_skills.evaluation.math_grader.search_boxed) and prefers the raw\boxed{…}LaTeX over math-verify's normalized form as judge input_strip_think_tags+skills_parity_modeflag that routes rollouts through Skills' full judge pipeline (parse_reasoning→search_boxed→ prefill shortcuts → LLM judge) for per-rollout parityparse_reasoning_like_skillsflag (new on this branch): applies the same extraction to the symbolic-only path (no judge), for benchmarks whose Skills config iseval_type=math+should_use_judge=falseValidated against NeMo Skills
Single comparison run on draco-oci: 48 problems × 4 rollouts/task, Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536. Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node DP vLLM (TP=8, DP=4, Ray-coordinated) with
+num_repeats=4.