feat(eval): --backend=gym pilot for ns eval (gsm8k) by gwarmstrong · Pull Request #1454 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-05-18T19:37:20Z

Summary

Scaffolds a Gym generation backend for ns eval, gated on an opt-in --backend={skills,gym} flag (default: skills, no behavior change). The gym path runs ng_run + ng_collect_rollouts against NeMo-Gym, converts the resulting rollouts into a Skills-shape output.jsonl, and lets the existing summarize_results step produce metrics.json unchanged.

Builds on PR #1222 (EvalGenerationUnit / EvalClientScript). Reuses ServerScript / SandboxScript / hostname-ref resolution and NemoGymRolloutsScript's install-and-wait shell scaffolding.

Pilot scope is gsm8k only — other benchmarks fall through to the existing Skills path. A tiny per-benchmark registry (nemo_skills/pipeline/utils/gym/registry.py) maps a benchmark name to (config_paths, agent_name); once Gym's auto-resolver PR lands, this is the obvious replacement point.

What's in here

Three reviewable commits:

feat(eval): add --backend={skills,gym} flag and Skills→Gym override translator
- EvalBackend enum + --backend typer flag on ns eval.
- nemo_skills/pipeline/utils/gym/translator.py: pure function that rewrites Skills ++ Hydra overrides into Gym + overrides for ng_collect_rollouts. Handles direct renames, extra_body merging (with dict-literal syntax to dodge the extra_body: null Hydra-override gotcha), and an explicit drop-list for fields the migration agreed not to carry over.
feat(eval): wire --backend=gym to a GymEvalClientScript pilot (gsm8k only)
- nemo_skills/pipeline/utils/scripts/gym_eval.py: GymEvalClientScript, peer of EvalClientScript. Builds a single shell command: resolve Gym path → activate venv → wait for policy server → start ng_run → poll ng_status → run one ng_collect_rollouts per unit → clean up.
- nemo_skills/pipeline/utils/gym/registry.py: temporary benchmark registry (gsm8k only).
- Dispatcher in pipeline/eval.py: registered-benchmark pre-check, one-benchmark-per-job guard, multi-model not supported.
feat(eval): gym→skills metrics adapter (math case)
- nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter + file-level wrapper + argparse CLI. Maps math_with_judge rollouts to {generation, predicted_answer, expected_answer, symbolic_correct} (plus an opt-in judgement placeholder for later judge benchmarks).
- Per-unit shell step appends the converter so summarize_results produces metrics.json with no changes.

Test plan

Translator: 43 unit tests (renames, extra_body merging, silent-drops, server/sandbox drops, rejected overrides, unknown-passthrough vs strict mode, realistic gsm8k combo).
GymEvalClientScript: 15 unit tests (registry, validation, shell-emit shape, per-seed / per-chunk output paths, translator pipeline, server hostname ref, sandbox env, multi-unit single-ng_run sharing, converter wiring).
Adapter: 15 unit tests (mapping, fallback when library_reward absent, multi-chunk output-text concat, judge surfacing, LaTeX pass-through, file IO, blank-line handling, dst-dir creation, CLI exit codes).
ns eval sentinel tests: backend flag wired; gym path rejects unregistered benchmarks; mixed registered+unregistered rejected.
No regressions in pre-existing test_eval.py.
End-to-end parity run (gsm8k Skills vs Gym, single seed → full benchmark 4-seed mean@4). Not in this PR — needs a cluster slot. This is the next deliverable.

Things deferred

Benchmark resolver: the temporary registry is a placeholder until the upstream Gym auto-resolver lands (a single-benchmark gsm8k entry today).
ns generate migration: separate follow-up; same dispatcher pattern.
Multi-model and chunking on the Gym path: raise NotImplementedError; Skills backend covers both.
Output-schema migration: explicitly NOT in this PR; the adapter keeps the existing summarize_results consumers working unchanged.

Risk

--backend=skills is the default and is untouched. The new code paths are gated behind --backend=gym and refuse to start for any benchmark not in the registry.

Summary by CodeRabbit

New Features
- Choose evaluation backend via CLI (skills vs gym) with benchmark validation and container selection.
- Gym evaluation client that installs, starts servers, collects per-unit rollouts, and can convert rollouts to Skills format.
- CLI tool to convert Gym rollouts into Skills-compatible JSONL (math metric supported).
- Translator that maps Skills-style CLI overrides into Gym-compatible flags and payloads.
- Exposed Gym registry/config utilities for pipeline use.
Tests
- Expanded test coverage for backend behavior, registry, translator, adapter, and Gym client script generation.

…ranslator Scaffolds the path for routing `ns eval` through NeMo-Gym. No behavior change when --backend=skills (the default); --backend=gym raises NotImplementedError until GymEvalClientScript is wired up. - Adds EvalBackend enum + --backend typer flag on ns eval. - Adds nemo_skills/pipeline/utils/gym/translator.py: a pure function that rewrites Skills `++` Hydra overrides into Gym `+` overrides for ng_collect_rollouts. Includes: * Direct renames (++inference.temperature → +responses_create_params.*) * extra_body merging for vLLM-only knobs (top_k, min_p, repetition_penalty, random_seed→seed), using dict-literal syntax to avoid the `extra_body: null` Hydra override gotcha. * Explicit drop list sourced from the migration's manager-review decisions. * Strict-mode option for unknown overrides; default is pass-through + warn so partial migration doesn't silently lose flags. Tests: 43 unit tests for the translator covering renames, extra_body merging, silent drops, server/sandbox drops, rejected overrides, unknown-passthrough vs strict mode, and a realistic gsm8k combo. Two sentinel tests on ns eval itself confirm the flag is wired and gym path raises until step 3 lands. Migration plan: see convert-eval-to-gym/TIERED_PLAN.md (Tier 1 steps 2 & 4). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…only) Builds the Gym client script and dispatches to it from `ns eval --backend=gym`. The previous commit added the flag + a NotImplementedError gate; this commit replaces that gate with real wiring. - nemo_skills/pipeline/utils/gym/registry.py: tiny benchmark → (config_paths, agent_name) registry. Only gsm8k registered for v1. Designed to be replaced by the upstream Gym auto-resolver PR once that lands. - nemo_skills/pipeline/utils/scripts/gym_eval.py: GymEvalClientScript, a peer of EvalClientScript. Surface matches (units, servers, sandbox, model_names, server_types, server_addresses_prehosted) plus Gym-specific fields (config_paths, agent_name, gym_path, policy_api_key, policy_model_name). Builds a single shell command: 1. Resolve Gym install path (importable nemo_gym → container fallback). 2. `uv sync --active --extra dev` to ensure venv exists. 3. Wait for the policy server. 4. Start ng_run with config_paths in background; poll ng_status until healthy. 5. Per unit: ng_collect_rollouts with the translator-rewritten extra_arguments; output file name carries -rs{seed}/-chunk{id} suffixes. 6. Clean up ng_run. Reuses NemoGymRolloutsScript's shell scaffolding; sandbox host/port are forwarded via env vars (Q5 decision: Skills keeps owning the sandbox). - Dispatcher in ns eval: * Early pre-check: --backend=gym fails fast for benchmarks not in the registry, with a clear pointer to --backend=skills as the fallback. * One-benchmark-per-job guard: a single SLURM job runs one shared ng_run mesh, so we refuse to mix benchmarks (and tell the user how to fix it). * Multi-model unsupported on the Gym path (Skills v1 keeps the multi-model eval path; raises NotImplementedError if requested). Tests: 12 new GymEvalClientScript tests (registry, validation, shell-emit shape, per-seed/per-chunk output paths, translator pipeline, server hostname ref, sandbox env, multi-unit single-ng_run sharing). Two ns eval sentinel tests replace the prior NotImplementedError check with a "rejects unregistered benchmark" check. Tier 1 step 3 (TIERED_PLAN.md) complete. Remaining: step 5 (metrics adapter) and step 6 (parity run on a cluster). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Wraps Tier 1 by closing the metrics gap: after each `ng_collect_rollouts`, convert Gym's `rollouts.jsonl` into a Skills-shape `output.jsonl` so the existing `summarize_results` task produces the right `metrics.json` without changes. The dispatcher passes the benchmark's `METRICS_TYPE` through to the client script; only "math" is implemented for v1 (gsm8k pilot). - nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter + file-level wrapper + argparse CLI suitable for shell invocation. Maps math_with_judge rollouts to {generation, predicted_answer, expected_answer, symbolic_correct} (plus an opt-in `judgement` placeholder for Tier 2 judge benchmarks). Unsupported metric_types raise NotImplementedError with a clear pointer. - GymEvalClientScript: new optional `metric_type` field. When set, each per-unit shell block runs the converter immediately after ng_collect_rollouts succeeds, writing the Skills-shape output.jsonl next to the rollouts.jsonl (with matching -rs{seed}/-chunk{id} suffixes). When unset, the post-rollout step is skipped — keeps the script useful for non-metric flows. - ns eval dispatcher: passes `benchmarks_dict[benchmark].metrics_type` (or the CLI `--metric_type` override) through to GymEvalClientScript. Tests: 15 new converter tests (correct/incorrect mapping, missing library_reward fallback, multi-chunk output_text concat, judge surfacing, LaTeX expected-answer pass-through, file IO, blank-line handling, dst-dir creation, CLI exit codes, unknown-type rejection). Three new GymEvalClientScript tests verify the converter is wired in (only when metric_type is set; produces -rs{seed}-suffixed Skills outputs; one call per unit). Tier 1 step 5 (TIERED_PLAN.md) complete. Remaining: step 6 (parity run gsm8k Skills vs Gym). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

The Gym client task runs `ng_run` + `ng_collect_rollouts`, which live in the gym container — the `nemo-skills` container doesn't have them. Without this the parity run on real clusters would fail at first invocation. Same container precedence as `nemo_gym_rollouts`: explicit `--main_container` beats everything; otherwise `containers['nemo-gym']` if present, else `containers['nemo-rl']`. `--backend=skills` is unchanged. Tests: - gym backend with only nemo-rl falls back to it. - gym backend with nemo-gym key prefers it. - skills backend still picks nemo-skills (regression guard). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

coderabbitai · 2026-05-18T19:43:24Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Gym backend support: benchmark registry and translator, GymEvalClientScript orchestration, Gym→Skills rollout converter and CLI, eval CLI backend selection, and comprehensive tests covering registry, translator, converter, script output, and eval wiring.

Changes

Gym Backend Integration

Layer / File(s)	Summary
Gym→Skills rollout converter `nemo_skills/pipeline/adapters/*`, `tests/test_gym_to_skills_adapter.py`	New adapter module converts Gym rollouts.jsonl to Skills-shaped output.jsonl (math metric), includes streaming file conversion, CLI, and unit/file/CLI tests.
Gym utilities, registry, and translator `nemo_skills/pipeline/utils/gym/*`, `tests/test_gym_translator.py`	Adds gym utils package entrypoint, immutable `GymBenchmarkConfig` registry and lookup helpers, and `translate_skills_overrides_to_gym` with rejection/drop rules and tests.
Gym eval orchestration script `nemo_skills/pipeline/utils/scripts/gym_eval.py`, `nemo_skills/pipeline/utils/scripts/__init__.py`, `tests/test_gym_eval_script.py`	Adds `GymEvalClientScript` building an inline shell scaffold to run `ng_run`/`ng_collect_rollouts`, translate overrides, optionally convert rollouts, and includes many tests validating emitted shell fragments and behavior.
Eval CLI/backend wiring `nemo_skills/pipeline/eval.py`, `tests/test_eval.py`	Adds `EvalBackend` enum and `--backend` option, validates requested gym benchmarks, constructs `GymEvalClientScript` for gym backend (single benchmark per job), and adjusts client container selection; tests for backend behavior and container selection.

Sequence Diagram(s)

sequenceDiagram
  participant Shell
  participant ng_run
  participant ng_status
  participant ng_collect
  participant Converter
  Shell->>ng_run: start background (config_paths, policy params)
  Shell->>ng_status: poll until healthy
  loop units
    Shell->>ng_collect: run with agent_name, input/output, translated overrides
    ng_collect-->>Shell: rollouts.jsonl
    alt metric_type set
      Shell->>Converter: python -m nemo_skills.pipeline.adapters.gym_to_skills --metric_type=...
      Converter-->>Shell: skills-shaped output.jsonl
    end
  end
  Shell->>ng_run: terminate

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers:

AlexGrinch
naymaraq

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically summarizes the main change: adding a Gym backend option for the eval command, with the pilot scope clearly limited to gsm8k via the parenthetical.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch georgea/convert-eval-to-gym

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/test_gym_to_skills_adapter.py (1)

63-123: ⚡ Quick win

Add contract tests for missing required rollout keys

Please add negative tests asserting KeyError for missing required keys (response, extracted_answer, expected_answer, and reward when library_reward is absent`) so schema drift fails loudly.

Suggested tests

 class TestConvertMathRollout:
+    def test_missing_expected_answer_raises(self):
+        rollout = _gym_rollout()
+        del rollout["expected_answer"]
+        with pytest.raises(KeyError):
+            convert_math_rollout(rollout)
+
+    def test_missing_response_raises(self):
+        rollout = _gym_rollout()
+        del rollout["response"]
+        with pytest.raises(KeyError):
+            convert_math_rollout(rollout)

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_gym_to_skills_adapter.py` around lines 63 - 123, Add negative
contract tests to tests/test_gym_to_skills_adapter.py that call
convert_math_rollout with rollouts missing required keys and assert they raise
KeyError: create variants from _gym_rollout then del rollout["response"], del
rollout["extracted_answer"], del rollout["expected_answer"], and a variant where
you remove "library_reward" and also del rollout["reward"] to verify missing
reward raises KeyError; wrap each call in pytest.raises(KeyError) so
convert_math_rollout fails loudly rather than using .get(), referencing
convert_math_rollout and _gym_rollout to locate the test targets.

tests/test_gym_translator.py (1)

202-208: ⚡ Quick win

Add an assertion for quoted passthrough token integrity.

This test currently misses the critical expectation that ++some_string="hello world" remains a single effective argument after translation.

Suggested assertion update

@@
     def test_quoted_values_survive(self):
@@
         result = translate_skills_overrides_to_gym('++inference.temperature=0.7 ++some_string="hello world"')
@@
         assert "+responses_create_params.temperature=0.7" in result
+        assert "some_string" in result
+        assert "hello world" in result

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_gym_translator.py` around lines 202 - 208, The test
test_quoted_values_survive is missing an assertion that the quoted token stays a
single argument; update the test to assert that
translate_skills_overrides_to_gym preserves the quoted passthrough as a single
token by checking for the expected output fragment corresponding to
++some_string="hello world" (use the translate_skills_overrides_to_gym call and
assert the resulting string contains the single-token form for that
passthrough), referencing the test function name and
translate_skills_overrides_to_gym to locate and modify the assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/adapters/gym_to_skills.py`:
- Around line 45-54: The converter is silently tolerating missing rollout-schema
keys by using dict.get, so update the code to fail fast: in _extract_output_text
and the related extractor around lines referenced (handling response,
extracted_answer, expected_answer, reward fallback), replace uses of
response.get(...) and similar .get() calls with direct key access (e.g.,
response["output"], item["content"], c["type"], c["text"]) and add explicit,
descriptive exceptions (KeyError or ValueError) when required keys are absent so
schema drift surfaces immediately; ensure any fallback logic for reward is
removed or replaced with a clear error and include the symbol names
(_extract_output_text and the other extractor functions handling
extracted_answer/expected_answer) in your changes.
- Around line 133-151: The convert_file function currently opens output_path for
writing before conversion, risking partial/truncated output on errors; change
convert_file so it first fully reads and converts all rollouts by iterating
convert_rollouts(rollouts, metric_type=metric_type) into an in-memory list (or
stream to a temp file in the same directory), only then open output_path for
writing (or atomically replace it by writing to a temp file and os.replace into
output_path). Ensure you preserve the returned row count n and the LOG.info
message, and reference convert_rollouts, rollouts_path, output_path, and n when
locating where to implement the two-phase convert-then-write behavior.

In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 139-197: The returned command string is built with "
".join(passthrough_parts + gym_parts) which loses shell quoting and splits
tokens containing spaces; replace that final join with
shlex.join(passthrough_parts + gym_parts) (ensure shlex is imported/available)
so tokens like values merged into extra_body remain properly shell-quoted;
update the return expression in the function that builds
passthrough_parts/gym_parts/extra_body (the block that currently ends with
return " ".join(...).strip()) to use shlex.join(...).strip() instead.

---

Nitpick comments:
In `@tests/test_gym_to_skills_adapter.py`:
- Around line 63-123: Add negative contract tests to
tests/test_gym_to_skills_adapter.py that call convert_math_rollout with rollouts
missing required keys and assert they raise KeyError: create variants from
_gym_rollout then del rollout["response"], del rollout["extracted_answer"], del
rollout["expected_answer"], and a variant where you remove "library_reward" and
also del rollout["reward"] to verify missing reward raises KeyError; wrap each
call in pytest.raises(KeyError) so convert_math_rollout fails loudly rather than
using .get(), referencing convert_math_rollout and _gym_rollout to locate the
test targets.

In `@tests/test_gym_translator.py`:
- Around line 202-208: The test test_quoted_values_survive is missing an
assertion that the quoted token stays a single argument; update the test to
assert that translate_skills_overrides_to_gym preserves the quoted passthrough
as a single token by checking for the expected output fragment corresponding to
++some_string="hello world" (use the translate_skills_overrides_to_gym call and
assert the resulting string contains the single-token form for that
passthrough), referencing the test function name and
translate_skills_overrides_to_gym to locate and modify the assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a92f9211-53e0-4a58-9768-930c74ebd2d7

📥 Commits

Reviewing files that changed from the base of the PR and between 467b057 and 4bd1542.

📒 Files selected for processing (12)

nemo_skills/pipeline/adapters/__init__.py
nemo_skills/pipeline/adapters/gym_to_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/gym/__init__.py
nemo_skills/pipeline/utils/gym/registry.py
nemo_skills/pipeline/utils/gym/translator.py
nemo_skills/pipeline/utils/scripts/__init__.py
nemo_skills/pipeline/utils/scripts/gym_eval.py
tests/test_eval.py
tests/test_gym_eval_script.py
tests/test_gym_to_skills_adapter.py
tests/test_gym_translator.py

coderabbitai · 2026-05-18T19:43:27Z

+def _extract_output_text(response: Dict[str, Any]) -> str:
+    """Pull the assistant-message text out of an OpenAI Responses-API response."""
+    chunks: List[str] = []
+    for item in response.get("output", []) or []:
+        for c in item.get("content", []) or []:
+            if c.get("type") == "output_text":
+                text = c.get("text")
+                if text:
+                    chunks.append(text)
+    return "".join(chunks)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required rollout schema keys

This converter silently degrades required fields (response, extracted_answer, expected_answer, and reward fallback), which can emit corrupted Skills rows instead of surfacing schema drift immediately.

Proposed fix

def _extract_output_text(response: Dict[str, Any]) -> str: """Pull the assistant-message text out of an OpenAI Responses-API response.""" chunks: List[str] = [] - for item in response.get("output", []) or []: - for c in item.get("content", []) or []: - if c.get("type") == "output_text": - text = c.get("text") + for item in response["output"]: + for c in item["content"]: + if c["type"] == "output_text": + text = c["text"] if text: chunks.append(text) return "".join(chunks) @@ - library_reward = rollout.get("library_reward") - if library_reward is None: + library_reward = rollout["library_reward"] if "library_reward" in rollout else rollout["reward"] + if library_reward is None: # Fall back to the overall reward when the server doesn't expose the # symbolic-only signal (we lose the judge/symbolic separation but at # least we get a pass/fail). - library_reward = rollout.get("reward", 0.0) + library_reward = rollout["reward"] @@ out: Dict[str, Any] = { - "generation": _extract_output_text(rollout.get("response", {}) or {}), - "predicted_answer": rollout.get("extracted_answer"), - "expected_answer": rollout.get("expected_answer"), + "generation": _extract_output_text(rollout["response"]), + "predicted_answer": rollout["extracted_answer"], + "expected_answer": rollout["expected_answer"], "symbolic_correct": bool(library_reward == 1.0), }

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 70-81

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 45 - 54, The converter is silently tolerating missing rollout-schema keys by using dict.get, so update the code to fail fast: in _extract_output_text and the related extractor around lines referenced (handling response, extracted_answer, expected_answer, reward fallback), replace uses of response.get(...) and similar .get() calls with direct key access (e.g., response["output"], item["content"], c["type"], c["text"]) and add explicit, descriptive exceptions (KeyError or ValueError) when required keys are absent so schema drift surfaces immediately; ensure any fallback logic for reward is removed or replaced with a clear error and include the symbol names (_extract_output_text and the other extractor functions handling extracted_answer/expected_answer) in your changes.

coderabbitai · 2026-05-18T19:43:27Z

+def convert_file(
+    rollouts_path: str | Path,
+    output_path: str | Path,
+    *,
+    metric_type: str,
+) -> int:
+    """Read `rollouts_path`, write Skills-shape JSONL to `output_path`. Returns row count."""
+    rollouts_path = Path(rollouts_path)
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+
+    n = 0
+    with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout:
+        rollouts = (json.loads(line) for line in fin if line.strip())
+        for prediction in convert_rollouts(rollouts, metric_type=metric_type):
+            fout.write(json.dumps(prediction))
+            fout.write("\n")
+            n += 1
+    LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Protect output files from partial writes on conversion failures

convert_file truncates/writes output_path while parsing/converting is still running. If a malformed JSON line or schema error occurs, callers can be left with partial output.

Proposed fix

+import tempfile @@ def convert_file( @@ ) -> int: @@ - n = 0 - with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout: - rollouts = (json.loads(line) for line in fin if line.strip()) - for prediction in convert_rollouts(rollouts, metric_type=metric_type): - fout.write(json.dumps(prediction)) - fout.write("\n") - n += 1 + n = 0 + with rollouts_path.open("r", encoding="utf-8") as fin: + rollouts = (json.loads(line) for line in fin if line.strip()) + converted = list(convert_rollouts(rollouts, metric_type=metric_type)) + + with tempfile.NamedTemporaryFile("w", encoding="utf-8", dir=output_path.parent, delete=False) as tmp: + tmp_path = Path(tmp.name) + for prediction in converted: + tmp.write(json.dumps(prediction)) + tmp.write("\n") + n += 1 + + tmp_path.replace(output_path) LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path) return n

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 133 - 151, The convert_file function currently opens output_path for writing before conversion, risking partial/truncated output on errors; change convert_file so it first fully reads and converts all rollouts by iterating convert_rollouts(rollouts, metric_type=metric_type) into an in-memory list (or stream to a temp file in the same directory), only then open output_path for writing (or atomically replace it by writing to a temp file and os.replace into output_path). Ensure you preserve the returned row count n and the LOG.info message, and reference convert_rollouts, rollouts_path, output_path, and n when locating where to implement the two-phase convert-then-write behavior.

coderabbitai · 2026-05-18T19:43:27Z

+    for token in shlex.split(extra_arguments):
+        key, value, prefix = _parse_token(token)
+
+        # Non-override token (e.g. `--config-name foo`): pass through unchanged.
+        if prefix is None:
+            passthrough_parts.append(token)
+            continue
+
+        if key in _REJECTED:
+            raise UnsupportedSkillsOverrideError(f"{key}: {_REJECTED[key]}")
+
+        if key in _SILENTLY_DROPPED:
+            LOG.debug("Dropping Skills override %s (%s)", key, _SILENTLY_DROPPED[key])
+            continue
+
+        if any(key.startswith(p) for p in _DROPPED_PREFIXES):
+            matched_prefix = next(p for p in _DROPPED_PREFIXES if key.startswith(p))
+            LOG.debug("Dropping Skills override %s (%s)", key, _DROPPED_PREFIXES[matched_prefix])
+            continue
+
+        if key in _EXTRA_BODY_KEYS:
+            extra_body[_EXTRA_BODY_KEYS[key]] = value
+            continue
+
+        if key in _DIRECT_RENAMES:
+            gym_parts.append(f"+{_DIRECT_RENAMES[key]}={value}")
+            continue
+
+        # `++inference.extra_body.foo=bar` → merge into extra_body dict.
+        if key.startswith("inference.extra_body."):
+            sub_key = key[len("inference.extra_body.") :]
+            extra_body[sub_key] = value
+            continue
+
+        # Server / sandbox overrides are handled by the pipeline layer
+        # (model-server YAML assembly), not by ng_collect_rollouts. Drop them
+        # here so they don't end up as bogus Gym flags.
+        if key.startswith("server.") or key.startswith("sandbox."):
+            LOG.debug("Dropping Skills override %s (handled by pipeline layer)", key)
+            continue
+
+        if strict_unknown:
+            raise UnsupportedSkillsOverrideError(
+                f"No Gym mapping for Skills override `++{key}`. "
+                f"Add it to nemo_skills/pipeline/utils/gym/translator.py "
+                f"or use --backend=skills."
+            )
+        LOG.warning("Unmapped Skills override `++%s` passed through unchanged", key)
+        passthrough_parts.append(token)
+
+    if extra_body:
+        # Dict-literal syntax to dodge the `extra_body: null` Hydra gotcha
+        # (vllm_model.yaml ships extra_body: null, which silently no-ops
+        # `++…extra_body.foo=bar`). See memory/feedback_hydra_extra_body_override.md.
+        body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items()))
+        gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}")
+
+    return " ".join(passthrough_parts + gym_parts).strip()
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant section cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '130,205p'

Repository: NVIDIA-NeMo/Skills

Length of output: 3631

🏁 Script executed:

# Also check the function signature to understand what this is part of cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '120,205p'

Repository: NVIDIA-NeMo/Skills

Length of output: 4138

🏁 Script executed:

# Search for usages of this function's return value to understand how output is consumed rg "translate_skills_args|_translate_overrides" nemo_skills/ -A 5 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check the function signature and docstring to understand return value requirements cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '100,130p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1454

🏁 Script executed:

# Check what Python version this project targets find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | head -5 | xargs cat 2>/dev/null | grep -A 5 -B 5 "python"

Repository: NVIDIA-NeMo/Skills

Length of output: 1627

🏁 Script executed:

# Search for actual usage of the translate_skills_args function in shell contexts rg "translate_skills_args\(" nemo_skills/ -B 3 -A 3

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Search for actual usage and how the return value is consumed rg "translate_skills_overrides_to_gym" --type py -B 2 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 13733

🏁 Script executed:

# Check if shlex.join is used elsewhere in the codebase rg "shlex\.join" nemo_skills/

Repository: NVIDIA-NeMo/Skills

Length of output: 395

🏁 Script executed:

# Look for subprocess calls or shell command construction that might use this function's output rg "ng_collect_rollouts" --type py -B 5 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 21763

🏁 Script executed:

# Verify the specific issue: check how gym_eval.py constructs the full command cat -n nemo_skills/pipeline/utils/scripts/gym_eval.py | sed -n '245,290p'

Repository: NVIDIA-NeMo/Skills

Length of output: 2275

🏁 Script executed:

# Look at a concrete scenario: what happens with values containing spaces in extra_body python3 <<'EOF' import shlex # Simulate the scenario key = "foo" value = "hello world" # Value with space # Current approach (what the code does) gym_parts = [] extra_body = {} extra_body["foo"] = "hello world" body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items())) gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}") result_current = " ".join(gym_parts).strip() print(f"Current result: {result_current}") print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_current)}") # With shlex.join result_proper = shlex.join(gym_parts).strip() print(f"\nWith shlex.join: {result_proper}") print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_proper)}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 410

Preserve shell token boundaries when rebuilding translated args.

Line 196 uses plain string join after shlex.split(), which fails to quote tokens containing spaces. When values are merged into the extra_body dict (e.g., "hello world"), the resulting string gets split into multiple tokens when parsed as shell arguments rather than a single argument. The docstring promises a "shell-quoted string" but the implementation doesn't deliver it.

Use shlex.join() instead, which is available in Python 3.8+ (project requires >=3.10) and already used elsewhere in the codebase:

- return " ".join(passthrough_parts + gym_parts).strip() + return shlex.join(passthrough_parts + gym_parts).strip()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 139 - 197, The returned command string is built with " ".join(passthrough_parts + gym_parts) which loses shell quoting and splits tokens containing spaces; replace that final join with shlex.join(passthrough_parts + gym_parts) (ensure shlex is imported/available) so tokens like values merged into extra_body remain properly shell-quoted; update the return expression in the function that builds passthrough_parts/gym_parts/extra_body (the block that currently ends with return " ".join(...).strip()) to use shlex.join(...).strip() instead.

coderabbitai

🧹 Nitpick comments (1)

tests/test_eval.py (1)

349-598: 💤 Low value

Consider extracting shared setup into a fixture.

The three container selection tests share substantial setup code (cluster_config, benchmark_args, unit creation, monkeypatches). A parameterized fixture or helper function could reduce duplication and make the test intent clearer. That said, the current implementation is readable and self-contained.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval.py` around lines 349 - 598, Extract the repeated setup in
test_gym_backend_picks_gym_container_falls_back_to_nemo_rl,
test_gym_backend_prefers_nemo_gym_container_when_present, and
test_skills_backend_keeps_using_nemo_skills_container into a shared pytest
fixture or helper function: move creation of cluster_config, benchmark_args,
unit, fake_command_init, and the repeated monkeypatch.setattr calls for
eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths,
get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and
Pipeline.run into that fixture/helper and have each test call it (or parametrize
container mappings and backend); keep assertions (captured["client_container"])
and calls to eval_pipeline.eval in the individual tests and reference the same
symbols (cluster_config, benchmark_args, unit, fake_command_init,
prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_eval.py`:
- Around line 349-598: Extract the repeated setup in
test_gym_backend_picks_gym_container_falls_back_to_nemo_rl,
test_gym_backend_prefers_nemo_gym_container_when_present, and
test_skills_backend_keeps_using_nemo_skills_container into a shared pytest
fixture or helper function: move creation of cluster_config, benchmark_args,
unit, fake_command_init, and the repeated monkeypatch.setattr calls for
eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths,
get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and
Pipeline.run into that fixture/helper and have each test call it (or parametrize
container mappings and backend); keep assertions (captured["client_container"])
and calls to eval_pipeline.eval in the individual tests and reference the same
symbols (cluster_config, benchmark_args, unit, fake_command_init,
prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bba08dfd-6a14-48fb-bd19-a13a4aabf801

📥 Commits

Reviewing files that changed from the base of the PR and between 4bd1542 and b2437b5.

📒 Files selected for processing (2)

nemo_skills/pipeline/eval.py
tests/test_eval.py

Skills and Gym ship per-benchmark input JSONLs with different schemas — for gsm8k, Skills uses `problem` and Gym uses `question`, so feeding Skills' test.jsonl into ng_collect_rollouts via the resource server's `apply_prompt_to_row` produces empty prompts. The unit tests passed because they never exercised the schema dependency. - Add `input_jsonl_fpath` to GymBenchmarkConfig (required). - New `gym_input_jsonl_fpath` field on GymEvalClientScript: when set, the ng_collect_rollouts input file is the resolved `$GYM_PATH/<rel-path>` at runtime instead of the unit's Skills-side input. - Dispatcher in ns eval threads the registry value through. Tests: override forces the Gym path; absence falls back to the unit's input_file (kept for back-compat / non-benchmark uses). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Skills' `++prompt_config=generic/math` is a path under nemo_skills/prompt/config/. The translator was passing it through as an unknown override; Gym's `ng_collect_rollouts` accepts a `prompt_config` field but interprets it as a literal file path under the Gym install dir, so the smoke run crashed with `FileNotFoundError: /opt/Gym/generic/math`. In Gym the prompt config lives in the agent yaml's `datasets[]` entry — nothing should override it via CLI. Add prompt_config and prompt_template (used by some judge benchmarks) to the silent-drop list. Tests: parametrised drop coverage, plus the gsm8k realistic-combo regression test now asserts no `prompt_config` leaks through. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/pipeline/utils/gym/translator.py (1)

145-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use shlex.join to preserve argument boundaries.

Line 202 rebuilds argv with plain " ".join(...), which drops shell quoting and can corrupt overrides containing spaces.

Proposed fix

-    return " ".join(passthrough_parts + gym_parts).strip()
+    return shlex.join(passthrough_parts + gym_parts).strip()

#!/bin/bash
# Verify current implementation and reproduce token-boundary breakage.
set -euo pipefail

echo "1) Confirm return expression:"
nl -ba nemo_skills/pipeline/utils/gym/translator.py | sed -n '190,206p'

echo
echo "2) Demonstrate why plain join is lossy:"
python - <<'PY'
import shlex
parts = ["+responses_create_params.extra_body={note: hello world}", "--flag", "x"]
plain = " ".join(parts)
safe = shlex.join(parts)
print("plain :", plain)
print("split :", shlex.split(plain))
print("safe  :", safe)
print("split :", shlex.split(safe))
assert shlex.split(safe) == parts
PY

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 145 - 202, The
return value is rebuilt with " ".join(passthrough_parts + gym_parts) which loses
shell quoting and can break tokens with spaces; change that final join to use
shlex.join(passthrough_parts + gym_parts) so argument boundaries are preserved
(and add/import shlex at top if not already imported); keep the same list order
(passthrough_parts then gym_parts) and ensure the existing variables
passthrough_parts and gym_parts are used unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 145-202: The return value is rebuilt with "
".join(passthrough_parts + gym_parts) which loses shell quoting and can break
tokens with spaces; change that final join to use shlex.join(passthrough_parts +
gym_parts) so argument boundaries are preserved (and add/import shlex at top if
not already imported); keep the same list order (passthrough_parts then
gym_parts) and ensure the existing variables passthrough_parts and gym_parts are
used unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b447759-28d9-480c-936f-eb997cf6986d

📥 Commits

Reviewing files that changed from the base of the PR and between 85d1b1c and 53a2530.

📒 Files selected for processing (2)

nemo_skills/pipeline/utils/gym/translator.py
tests/test_gym_translator.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_gym_translator.py

The Skills-derived input JSONL carries raw fields (question/expected_answer) rather than pre-built responses_create_params, so ng_collect_rollouts has to render the prompt itself. The Gym agent yaml lists the prompt path inside `datasets[].prompt_config`, but the CLI doesn't read it from there — it has to be passed as +prompt_config explicitly. - Add `prompt_config` field (required) to GymBenchmarkConfig; populate it for gsm8k from the agent yaml's value. - New optional `gym_prompt_config` on GymEvalClientScript: when set, injects `+prompt_config="$GYM_PATH"/<rel-path>` so the path resolves against the same Gym dir as everything else at runtime. - Dispatcher in ns eval threads the registry value through. Without this fix the rollout crashes with `KeyError: 'responses_create_params'` during input preprocessing (rollout_collection.py:216). Tests: injection when set; absent when not. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 570bec36-f234-4610-beef-debb514494bd

📥 Commits

Reviewing files that changed from the base of the PR and between 53a2530 and 57c9f71.

📒 Files selected for processing (4)

nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/gym/registry.py
nemo_skills/pipeline/utils/scripts/gym_eval.py
tests/test_gym_eval_script.py

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/pipeline/utils/gym/registry.py
tests/test_gym_eval_script.py
nemo_skills/pipeline/utils/scripts/gym_eval.py

coderabbitai · 2026-05-18T20:35:14Z

+            if backend == EvalBackend.gym.value:
+                # Gym runs one shared `ng_run` mesh per SLURM job, so all units
+                # in a job must share the same Gym wiring. Enforce one benchmark
+                # per job rather than per-unit mesh-switching.
+                if len(job_benchmarks) > 1:
+                    raise ValueError(
+                        f"--backend=gym requires one benchmark per SLURM job; this job "
+                        f"would mix {sorted(job_benchmarks)}. Re-run with --num_jobs=-1 "
+                        f"(or large enough that each benchmark gets its own job)."
+                    )
+                (gym_benchmark,) = job_benchmarks
+                gym_cfg: GymBenchmarkConfig = get_gym_config(gym_benchmark)
+                # Effective metric_type for the gym→skills converter: prefer
+                # the CLI override, else the benchmark's METRICS_TYPE module
+                # constant. Same precedence summarize_results uses.
+                gym_metric_type = metric_type or benchmarks_dict[gym_benchmark].metrics_type
+                client_script = GymEvalClientScript(
+                    units=unit_dicts,
+                    config_paths=list(gym_cfg.config_paths),
+                    agent_name=gym_cfg.agent_name,
+                    gym_input_jsonl_fpath=gym_cfg.input_jsonl_fpath,
+                    gym_prompt_config=gym_cfg.prompt_config,
+                    metric_type=gym_metric_type,
+                    single_node_mode=single_node_mode,
+                    with_sandbox=sandbox_enabled,
+                    servers=server_scripts,
+                    server_addresses_prehosted=server_addresses_list,
+                    model_names=models_list,
+                    server_types=server_types_list,
+                    sandbox=sandbox_script,
+                    installation_command=installation_command,
+                )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unsupported Gym modes here.

Lines 619-650 only enforce “one benchmark per job”. --backend=gym still accepts multi-model inputs and chunked eval units, even though this path is scoped to a single model without chunking. Please fail fast before constructing GymEvalClientScript; otherwise those user flags are accepted and the failure moves downstream.

Suggested guard

if backend == EvalBackend.gym.value: + if num_models != 1: + raise NotImplementedError( + "--backend=gym currently supports exactly one model per evaluation job." + ) + if any(unit["chunk_id"] is not None or (unit["num_chunks"] or 1) > 1 for unit in unit_dicts): + raise NotImplementedError( + "--backend=gym does not support --num_chunks/--chunk_ids yet." + ) # Gym runs one shared `ng_run` mesh per SLURM job, so all units # in a job must share the same Gym wiring. Enforce one benchmark # per job rather than per-unit mesh-switching.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

ng_collect_rollouts writes a sidecar `<stem>_materialized_inputs.jsonl` next to its output. The parent directory has to exist before that, but neither nemo_run nor the GymEvalClientScript was creating it, so the smoke run hit: FileNotFoundError: [Errno 2] No such file or directory: '<output_dir>/rollouts_materialized_inputs.jsonl' Mirrors NemoGymRolloutsScript which does the same mkdir. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

The post-rollout converter runs inside the Gym container, which only has `nemo-skills-core` installed (no `nemo_run`). The adapter previously lived under `nemo_skills.pipeline.adapters`; importing it triggered the pipeline package's `nemo_run` guard: ImportError: nemo-skills-core is installed but pipeline functionality requires the full package. Move to `nemo_skills.adapters` (top-level). The adapter has no pipeline dependencies, so this is purely about being reachable from the core install. Updates the `python -m` invocation in GymEvalClientScript and all test imports / shell-emit assertions. Surfaced by the smoke run: the rollout itself completed at 90% symbolic accuracy on 10 problems; only the post-step conversion failed to start. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Hydra accepts `+responses_create_params.extra_body={seed: 0}` as a single override, but the space inside the dict makes bash word-split it into two args. Hydra then sees `+…extra_body={seed:` and `0}` separately and dies with `no viable alternative at input '{seed:'`. Surfaced live on a 2-seed parity run (seed=0/1 → extra_body emitted; the smoke didn't trip it because no extra_body was generated when random_seed=None). Wrap both call sites in double quotes: - Translator (when min_p/top_k/repetition_penalty/random_seed produce extra_body, plus any explicit `++inference.extra_body.foo=bar`). - GymEvalClientScript (when per-unit random_seed surfaces as `extra_body={seed: N}`). Tests: existing extra_body assertions updated to expect the quoted form; new regression test for the per-seed shell-quoting case. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…forbid' I'd been routing Skills' vLLM-only knobs (top_k, min_p, repetition_penalty, random_seed) and the per-job random_seed through `+responses_create_params.extra_body={…}`. That's wrong: Gym's `NeMoGymResponseCreateParamsNonStreaming` is `extra="forbid"` and has no `extra_body` field on the per-call schema. The 2-seed parity run on the full benchmark surfaced this — every rollout 422'd with: "type": "extra_forbidden", "loc": ["body", "responses_create_params", "extra_body"] (The smoke didn't trip it because no extra_body was emitted when random_seed=None and the recipe didn't include any vLLM-only knobs.) The `extra_body` Hydra-override memory was about the *model-server* config field (`vllm_model.extra_body`) which is static / per-server, not per-call. The right home for these knobs is overriding the model-server config at ng_run time, not threading per-call args through ng_collect_rollouts. Changes: - Translator: vLLM-only knobs become silent-with-warning drops. Document the proper home in the warning text. Same for ++inference.extra_body.*. - GymEvalClientScript: stop injecting `extra_body={seed: N}` per unit. Per-seed reproducibility is a known gap — Skills' two seeds and Gym's two seeds will produce uncorrelated variance, but mean@N statistics still compare. Tests updated accordingly + regression assertions that `extra_body`/`seed:` never leak into the emitted shell command. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Scope correction: the migration intentionally breaks the Skills `output.jsonl` / `metrics.json` contract. Gym's native `rollouts.jsonl` + `rollouts_aggregate_metrics.json` are the canonical outputs on the `--backend=gym` path. Removes: - `nemo_skills/adapters/gym_to_skills.py` and its package (whole adapter). - `tests/test_gym_to_skills_adapter.py` (18 tests). - `metric_type` field, `_convert_cmd` method, and the `_skills_output_jsonl_for_unit` helper on `GymEvalClientScript`. The per-unit shell block is now just ng_collect_rollouts (no post-step). - The dispatcher's `gym_metric_type` plumbing. Changes: - `ns eval` skips the `summarize_results` SLURM task on the gym path — Gym already writes its aggregate metrics next to the rollouts JSONL, and `summarize_results` expects a Skills schema we no longer produce. Tests: - Replace the converter-wiring tests with a single negative-regression ("no skills schema artifacts in the emitted shell") on GymEvalClientScript. - New `test_eval.py::test_gym_backend_skips_summarize_results_task` asserts the dispatcher doesn't schedule the dependent summarize job for `--backend=gym`. The 2.46pp pass@1 gap reported in the bring-up parity run was an artefact of Skills' MathMetrics processing Gym's lenient `extracted_answer`. With this change we no longer feed Gym data through Skills' MathMetrics, so that gap doesn't apply. 83 tests pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

All four wrap `math_with_judge` with a per-benchmark agent of the same shape as gsm8k. Factor `_math_with_judge_entry()` so each registration is one line plus an optional prompt_config override (aime/hmmt ship their own per-benchmark prompt YAML; gsm8k and hendrycks_math share the shared `benchmarks/prompts/generic_math.yaml`). Tests: parametrized check that each entry resolves to the right config_paths / agent_name / input_jsonl_fpath / prompt_config — guards against typos in the path layout. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Upstream Gym restructured `benchmarks/prompts/`: the flat `generic_math.yaml`, `math_answer_judge.yaml`, etc. moved to sub-directories (`generic/math.yaml`, `judge/math.yaml`, …). The per-benchmark prompt YAMLs we'd been pointing aime24/aime25/hmmt_feb25 at (`benchmarks/<bench>/prompts/default.yaml`) were also removed — every math_with_judge benchmark now uses the shared `benchmarks/prompts/generic/math.yaml`. Update all five registry entries to the new path, simplify the helper's default to match, and tighten the parametrized registry test to assert the shared prompt config. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Phase 3 starts diversifying metric types beyond math. Neither benchmark fits the `math_with_judge` helper (agent names use `_mcqa_` / `_benchmark_` infix, not `_math_with_judge_`), so they're registered as standalone GymBenchmarkConfig entries. - gpqa → resources_servers/mcqa with `lenient_answer_colon_md` grading; data file is `gpqa_diamond_benchmark.jsonl` (Gym ships the diamond split as the default). - ifbench → resources_servers/ifbench; agent name has `_benchmark_` infix per the Gym yaml convention. livecodebench deferred — has v5/v6 split, needs sandbox, and uses the code_gen resource server. Separate cycle. Tests: per-benchmark registry checks asserting agent_name / input file / prompt path. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

…ompt path - livecodebench: Skills' `EVAL_SPLIT="test_v6_2408_2505"` maps to Gym's `benchmarks/livecodebench/v6_2408_2505/` config. Code execution is handled inside Gym's code_gen resource server (Ray subprocess pool), so this entry doesn't set Skills' `--with-sandbox`. - gpqa: upstream Gym moved `benchmarks/<bench>/prompts/default.yaml` to shared `benchmarks/prompts/eval/aai/mcq-4choices.yaml`. Update the registry entry + test to point at the current path. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Replace the hand-rolled 8-entry dict with the full set of 57 benchmarks whose Gym `benchmarks/<name>/config.yaml` maps to a Skills dataset module. Entries are generated by `tools/discover_registry.py` in the recipe root. Regenerate after any upstream Gym pull. Most are registered but unvalidated — adding the entry just unblocks `--backend=gym` to try them. Each will surface its own quirks (metric naming, container deps, schema mismatches) when actually run. Known already-validated set: gsm8k, aime24/25, hmmt_feb25, hendrycks_math (math), ifbench (IF), gpqa (mcqa). livecodebench has the registry entry but live-validation is still pending. Also relaxed `GymBenchmarkConfig.prompt_config` to `Optional[str]` since some Gym benchmarks (aalcr, mrcr) ship without one — input rows already carry `responses_create_params.input`. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong added 4 commits May 18, 2026 11:59

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

gwarmstrong added 2 commits May 18, 2026 12:56

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

gwarmstrong marked this pull request as draft May 18, 2026 20:48

gwarmstrong added 9 commits May 18, 2026 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): --backend=gym pilot for ns eval (gsm8k)#1454

feat(eval): --backend=gym pilot for ns eval (gsm8k)#1454
gwarmstrong wants to merge 17 commits into
mainfrom
georgea/convert-eval-to-gym

gwarmstrong commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwarmstrong commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Test plan

Things deferred

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gwarmstrong commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading