Skip to content

Add APEX Shortlist benchmark#1105

Merged
cmunley1 merged 14 commits intoNVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-apex-shortlist
Apr 25, 2026
Merged

Add APEX Shortlist benchmark#1105
cmunley1 merged 14 commits intoNVIDIA-NeMo:mainfrom
gwarmstrong:georgea/migrate-gym-apex-shortlist

Conversation

@gwarmstrong
Copy link
Copy Markdown
Contributor

Add APEX Shortlist benchmark

Migrates the apex-shortlist benchmark from NeMo Skills into Gym on top of the existing math_with_judge resource server. Verification uses the server's symbolic-only path (math-verify, should_use_judge: false) with a new opt-in parse_reasoning_like_skills flag that mirrors Skills' parse_reasoning=True + brace-matched \boxed{…} extraction — needed to avoid spurious mid-reasoning extractions on truncated generations.

Includes

  • benchmarks/apex_shortlist/ — benchmark config, prepare.py, prompt template
    • Data source: MathArena/apex-shortlist on HuggingFace (48 problems, 32 integer + 16 symbolic answers)
  • resources_servers/math_with_judge/ — extended (not new)
    • _search_boxed brace-matching extractor (mirrors nemo_skills.evaluation.math_grader.search_boxed) and prefers the raw \boxed{…} LaTeX over math-verify's normalized form as judge input
    • _strip_think_tags + skills_parity_mode flag that routes rollouts through Skills' full judge pipeline (parse_reasoningsearch_boxed → prefill shortcuts → LLM judge) for per-rollout parity
    • parse_reasoning_like_skills flag (new on this branch): applies the same extraction to the symbolic-only path (no judge), for benchmarks whose Skills config is eval_type=math + should_use_judge=false

Validated against NeMo Skills

Single comparison run on draco-oci: 48 problems × 4 rollouts/task, Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536. Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node DP vLLM (TP=8, DP=4, Ray-coordinated) with +num_repeats=4.

===========================================================================
eval_type=math (symbolic-only, math-verify) | 4 rollouts/task | T=1.0 top_p=0.95
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
===========================================================================
Metric                        Skills         Gym         Delta
---------------------------------------------------------------
pass@1[avg-of-4]               32.3%         34.4%        +2.1%
majority@4                     40.5%         41.7%        +1.2%
pass@4                         56.3%         54.2%        -2.1%
no_answer@1[avg-of-4]          26.6%         29.7%        +3.1%

Migrates NeMo Skills' `apex-shortlist` benchmark into Gym. Reuses the
`math_with_judge` resource server in symbolic-only mode
(`should_use_judge: false`) to mirror Skills' `eval_type=math` default.
Prompt is a character-for-character copy of Skills' `generic/math.yaml`.

Data source: MathArena/apex-shortlist on HuggingFace (48 problems,
mix of integer and symbolic answers).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…s judge input

math-verify is tuned for numeric/algebraic answers; it silently mangles
non-numeric \boxed content (function definitions, sets, conditions) into
a degenerate fragment during extraction. For example:

  model wrote:  \boxed{g(x)=2x^{3}+C \text{ or } g(x)=-2x^{3}+C, C\in\mathbb{R}}
  math-verify extracted:  "2"

When the symbolic-first cascade falls through to the LLM judge on such
cases, the judge saw "2" and (correctly) rejected it, even though the
model's actual answer was the full function definition.

The fix: when sending the answer to the judge, prefer the raw LaTeX
inside the model's last balanced \boxed{...} (or \fbox{...}) over
math-verify's normalized form. Fall back to math-verify's extraction
when no balanced \boxed is present, then to the full generation — the
prior behavior — if neither is available.

Implementation is a small, local change to _verify_answer and a new
brace-matching _search_boxed helper that mirrors
nemo_skills.evaluation.math_grader.search_boxed so Gym can pass the
same LaTeX a Skills baseline would pass to its judge.

Does NOT change:
  - extracted_answer field (still math-verify's output, used for
    majority-vote grouping and the symbolic_accuracy diagnostic)
  - library_reward / symbolic_accuracy semantics
  - behavior when symbolic passes (judge remains skipped as before)

Discovered on imo-answerbench parity comparison: 107/1600 rollouts
fell into this bucket and were flipped by the judge on Gym but judged
correctly on Skills, driving a 5pp pass@1 gap that was not a migration
bug.

Tests: 9 new (TestSearchBoxed, TestRawBoxedJudgeInput) + 21 existing
still pass.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
New config flag (default False) that mirrors NeMo Skills' judge pipeline
end-to-end, so Gym's per-rollout reward matches Skills' `judge_correct`
verdict 1:1 — not just the aggregate average.

When skills_parity_mode=True:
  1. Strip <think>...</think> / <thinking>...</thinking> from the raw
     model output (mirrors Skills' parse_reasoning=True). Also strips
     pre-</think> content when only a closing tag survives, which is
     common when the reasoning parser eats the opening tag.
  2. Extract the last balanced \boxed{...} content via the
     _search_boxed brace-matcher (not math-verify). The raw LaTeX the
     model actually wrote becomes `extracted_answer`, matching Skills'
     `predicted_answer` field used for majority@k grouping.
  3. Prefill shortcuts, mirroring nemo_skills.utils.prefill_judgement:
       empty extraction             -> reward 0, no LLM call
       extracted.strip() == expected.strip() -> reward 1, no LLM call
  4. Otherwise call the LLM judge unconditionally. Symbolic success does
     NOT short-circuit the judge in this mode — Skills' metric doesn't
     either.
  5. `reward` == final judge verdict (prefill or LLM), not
     `library_reward OR judge`. `library_reward` is still computed so
     the symbolic_accuracy diagnostic remains available.

`judge_evaluations` semantics: populated list for LLM calls, empty list
(not None) for prefill shortcuts. This keeps the existing
`_math_score_fn` logic working — it already gates judge_accuracy on
`judge_evaluations is not None`, and "prefill fired" IS a judge verdict.

Defaults preserved for every existing consumer. Only
benchmarks/imo_answerbench/config.yaml flips it on.

Tests: 9 new in TestSkillsParityMode + existing 30 tests still pass.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…ly parity

New config flag (default False) that applies Skills' parse_reasoning=True
+ search_boxed extraction to the symbolic-only (no-judge) verification
path — for benchmarks whose Skills config is `eval_type=math` with
`should_use_judge=false` (apex-shortlist, aime25, hmmt_feb25, etc.).

When parse_reasoning_like_skills=True:
  1. Strip <think>...</think> (and pre-</think> content) from the raw
     model output before extraction.
  2. Extract the last balanced \\boxed{...} content via _search_boxed
     (brace-matcher from nemo_skills.evaluation.math_grader).
  3. If extraction returns None, reward=0 — mirrors Skills'
     predicted_answer=None -> no_answer.
  4. Otherwise run math-verify on the raw \\boxed{<extracted>} as before.

Unlike skills_parity_mode, this does NOT route through the LLM judge —
it preserves the symbolic-only reward signal that pure-math benchmarks
expect. library_reward == reward (the symbolic_accuracy diagnostic
stays aligned with the headline metric).

Also declare the flag in resources_servers/math_with_judge/configs/
math_with_judge.yaml as `false`, so benchmark configs inheriting via
`_inherit_from` can override it without tripping OmegaConf struct-mode
"Key 'parse_reasoning_like_skills' is not in struct" errors.

Tests: 5 new in TestParseReasoningLikeSkills covering paired
<think>...</think>, pre-</think>-only, truncated (no close tag), plain
\\boxed without think tags, and wrong \\boxed content.

Discovered on apex-shortlist parity comparison: the original port
produced 37.5% pass@1 vs Skills' 32.3% and 1% no_answer vs 26.6%. Both
gaps trace to math-verify extracting spurious \\boxed{} (and even
numeric substrings) from mid-reasoning content that Skills' extractor
would have dropped.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
- `judge_prompt_path`, `judge_equal_label`, `judge_not_equal_label`, and
  `judge_bidirectional` were pulled into the config schema by the
  `skills_parity_mode` cherry-pick's context lines, but no code path on
  this branch reads them — they originate from an imo_answerbench
  commit (bb6f8c2) that was not cherry-picked here. Shipping them
  would silently accept overrides (e.g. `judge_prompt_path: ...`) that
  would have no effect.
- TestSkillsParityMode._build() no longer takes a `bidirectional`
  kwarg; nothing in the test file passes True to it anyway.
- Drop the `num_repeats here is IGNORED` comment from
  benchmarks/apex_shortlist/config.yaml — same information lives in
  the run script (`+num_repeats=4`) and the `migrate-benchmark` skill
  notes, and the raw config is not a great place to document CLI
  override semantics.

All 198 math_with_judge tests still pass.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Gym dropped few-shot prompt support project-wide many PRs ago; the
notes about 'no few-shots' / '{examples} empty by default' are stale
context from the Skills source.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gwarmstrong and others added 4 commits April 21, 2026 14:56
- test_llm_judge_runs_when_no_prefill_regardless_of_symbolic assumed a
  unidirectional-judge path that relied on judge_bidirectional=False.
  That config field was removed in the previous commit (it was never
  read by any code path on this branch). _verify_answer_with_judge
  always runs bidirectional, so when the first-order judge says
  equal, it runs a second-order cross-check — expect 2 judge calls
  and a 2-element judge_evaluations list.

- test_truncated_no_close_think_returns_no_answer encoded a wrong
  mental model: `_strip_think_tags` only strips when </think> is
  present, so "First step \boxed{99} more work" (no tags at all)
  passes through unchanged and _search_boxed legitimately extracts
  "99". Replace with test_no_boxed_returns_no_answer, which exercises
  the real no-answer path: model output with paired <think>…</think>
  but no \boxed{} after → _search_boxed returns None → reward=0.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…instead

The parse_reasoning_like_skills flag was added earlier on this branch to
handle <think>...</think> asymmetries between Skills and Gym on APEX
Shortlist's parity comparison. Empirically it worked (pass@1 delta
shrunk from +5.2 pp to +2.1 pp), but it moved the think-tag surgery
into the resource server.

The cleaner fix is to have vLLM strip reasoning at the model-output
layer via `--reasoning-parser deepseek_r1` (for Nemotron) on both
Skills' and Gym's vLLM servers. That eliminates the asymmetry
upstream, so neither side needs post-hoc string surgery. A/B
validated on apex-shortlist: pass@1 delta drops further from +2.1 pp
to -0.5 pp and cross-seed std_dev on the Gym side collapses from
4.96 to 1.70 pp.

Recipe scripts (run_apex_shortlist_{ns,gym}.py) now add
`--reasoning-parser deepseek_r1` to server_args on both sides.

Removed:
  - `parse_reasoning_like_skills: bool = False` from
    LibraryJudgeMathResourcesServerConfig
  - The code branch reading that flag in _verify_answer
  - `parse_reasoning_like_skills: false` declaration in
    resources_servers/math_with_judge/configs/math_with_judge.yaml
    (it was only there to allow struct-mode overrides via
    _inherit_from, which no longer occurs)
  - `parse_reasoning_like_skills: true` from
    benchmarks/apex_shortlist/config.yaml
  - TestParseReasoningLikeSkills class (5 tests)

Kept:
  - `_search_boxed` + `_strip_think_tags` + skills_parity_mode flag
    (cherry-picked from migrate-gym-imo-answerbench). These remain
    useful for LLM-judge benchmarks that can't rely on vLLM's
    reasoning parser doing the stripping.

All 198 math_with_judge ng_dev_test tests still pass.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Comment thread benchmarks/apex_shortlist/config.yaml Outdated
Comment thread benchmarks/apex_shortlist/config.yaml Outdated
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@gwarmstrong gwarmstrong force-pushed the georgea/migrate-gym-apex-shortlist branch from f98272e to f001230 Compare April 22, 2026 22:15
…g-parser)

The skills_parity_mode path was cherry-picked earlier to let Gym mirror
Skills' end-to-end judge pipeline for per-rollout parity on LLM-judge
benchmarks: strip <think>...</think>, extract via _search_boxed, run
prefill shortcuts, always call the judge bidirectionally.

The first three steps all assume Gym has to do think-tag surgery at the
resource-server layer. With vLLM's --reasoning-parser configured on the
model server (the recommended default per the migrate-benchmark skill),
think-tags are stripped at the model-output layer before they ever
reach the resource server — so the surgery is a no-op. And the
prefill-shortcut + always-bidirectional-judge semantics belong with
the LLM-judge benchmarks that exercise them (tracked on the
migrate-gym-imo-answerbench branch), not here.

Removed:
  - `skills_parity_mode: bool = False` config field
  - `if self.config.skills_parity_mode: ...` dispatch in _verify_answer
  - `_strip_think_tags` classmethod + the four `_THINK_*_RE` regex
    ClassVars
  - `_verify_answer_skills_parity` coroutine
  - `import re` (no longer used anywhere in this file)
  - `TestSkillsParityMode` class

Kept:
  - `_search_boxed` + its use in _verify_answer for judge-input
    preference over math-verify's normalized form. That fix is about
    mangled judge input on non-numeric \boxed{...} content and is
    independent of think-tag handling.

202 math_with_judge tests still pass via ng_dev_test.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
# to math-verify's extraction preserves the prior behavior when
# no balanced \boxed{} is present; ultimately falls back to the
# full generation if neither is available.
raw_boxed = self._search_boxed(generated_answer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if this impacts other environments using this verifier?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just now removed the judge changes after validating that I could fully resolve the Skills/Gym differences by using a reasoning parser. This was just an issue when \boxed inside of reasoning traces was triggering the grader.

…action as judge input"

This reverts commit 8b98293.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@gwarmstrong gwarmstrong force-pushed the georgea/migrate-gym-apex-shortlist branch from 4dbeea0 to 11ebb11 Compare April 24, 2026 21:08
@gwarmstrong gwarmstrong requested a review from cmunley1 April 24, 2026 21:10
@cmunley1 cmunley1 merged commit 2b71315 into NVIDIA-NeMo:main Apr 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants