Skip to content

feat(eval): --backend=gym pilot for ns eval (gsm8k)#1454

Draft
gwarmstrong wants to merge 17 commits into
mainfrom
georgea/convert-eval-to-gym
Draft

feat(eval): --backend=gym pilot for ns eval (gsm8k)#1454
gwarmstrong wants to merge 17 commits into
mainfrom
georgea/convert-eval-to-gym

Conversation

@gwarmstrong
Copy link
Copy Markdown
Collaborator

@gwarmstrong gwarmstrong commented May 18, 2026

Summary

Scaffolds a Gym generation backend for ns eval, gated on an opt-in --backend={skills,gym} flag (default: skills, no behavior change). The gym path runs ng_run + ng_collect_rollouts against NeMo-Gym, converts the resulting rollouts into a Skills-shape output.jsonl, and lets the existing summarize_results step produce metrics.json unchanged.

Builds on PR #1222 (EvalGenerationUnit / EvalClientScript). Reuses ServerScript / SandboxScript / hostname-ref resolution and NemoGymRolloutsScript's install-and-wait shell scaffolding.

Pilot scope is gsm8k only — other benchmarks fall through to the existing Skills path. A tiny per-benchmark registry (nemo_skills/pipeline/utils/gym/registry.py) maps a benchmark name to (config_paths, agent_name); once Gym's auto-resolver PR lands, this is the obvious replacement point.

What's in here

Three reviewable commits:

  1. feat(eval): add --backend={skills,gym} flag and Skills→Gym override translator

    • EvalBackend enum + --backend typer flag on ns eval.
    • nemo_skills/pipeline/utils/gym/translator.py: pure function that rewrites Skills ++ Hydra overrides into Gym + overrides for ng_collect_rollouts. Handles direct renames, extra_body merging (with dict-literal syntax to dodge the extra_body: null Hydra-override gotcha), and an explicit drop-list for fields the migration agreed not to carry over.
  2. feat(eval): wire --backend=gym to a GymEvalClientScript pilot (gsm8k only)

    • nemo_skills/pipeline/utils/scripts/gym_eval.py: GymEvalClientScript, peer of EvalClientScript. Builds a single shell command: resolve Gym path → activate venv → wait for policy server → start ng_run → poll ng_status → run one ng_collect_rollouts per unit → clean up.
    • nemo_skills/pipeline/utils/gym/registry.py: temporary benchmark registry (gsm8k only).
    • Dispatcher in pipeline/eval.py: registered-benchmark pre-check, one-benchmark-per-job guard, multi-model not supported.
  3. feat(eval): gym→skills metrics adapter (math case)

    • nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter + file-level wrapper + argparse CLI. Maps math_with_judge rollouts to {generation, predicted_answer, expected_answer, symbolic_correct} (plus an opt-in judgement placeholder for later judge benchmarks).
    • Per-unit shell step appends the converter so summarize_results produces metrics.json with no changes.

Test plan

  • Translator: 43 unit tests (renames, extra_body merging, silent-drops, server/sandbox drops, rejected overrides, unknown-passthrough vs strict mode, realistic gsm8k combo).
  • GymEvalClientScript: 15 unit tests (registry, validation, shell-emit shape, per-seed / per-chunk output paths, translator pipeline, server hostname ref, sandbox env, multi-unit single-ng_run sharing, converter wiring).
  • Adapter: 15 unit tests (mapping, fallback when library_reward absent, multi-chunk output-text concat, judge surfacing, LaTeX pass-through, file IO, blank-line handling, dst-dir creation, CLI exit codes).
  • ns eval sentinel tests: backend flag wired; gym path rejects unregistered benchmarks; mixed registered+unregistered rejected.
  • No regressions in pre-existing test_eval.py.
  • End-to-end parity run (gsm8k Skills vs Gym, single seed → full benchmark 4-seed mean@4). Not in this PR — needs a cluster slot. This is the next deliverable.

Things deferred

  • Benchmark resolver: the temporary registry is a placeholder until the upstream Gym auto-resolver lands (a single-benchmark gsm8k entry today).
  • ns generate migration: separate follow-up; same dispatcher pattern.
  • Multi-model and chunking on the Gym path: raise NotImplementedError; Skills backend covers both.
  • Output-schema migration: explicitly NOT in this PR; the adapter keeps the existing summarize_results consumers working unchanged.

Risk

--backend=skills is the default and is untouched. The new code paths are gated behind --backend=gym and refuse to start for any benchmark not in the registry.

Summary by CodeRabbit

  • New Features

    • Choose evaluation backend via CLI (skills vs gym) with benchmark validation and container selection.
    • Gym evaluation client that installs, starts servers, collects per-unit rollouts, and can convert rollouts to Skills format.
    • CLI tool to convert Gym rollouts into Skills-compatible JSONL (math metric supported).
    • Translator that maps Skills-style CLI overrides into Gym-compatible flags and payloads.
    • Exposed Gym registry/config utilities for pipeline use.
  • Tests

    • Expanded test coverage for backend behavior, registry, translator, adapter, and Gym client script generation.

Review Change Stack

…ranslator

Scaffolds the path for routing `ns eval` through NeMo-Gym. No behavior change
when --backend=skills (the default); --backend=gym raises NotImplementedError
until GymEvalClientScript is wired up.

- Adds EvalBackend enum + --backend typer flag on ns eval.
- Adds nemo_skills/pipeline/utils/gym/translator.py: a pure function that
  rewrites Skills `++` Hydra overrides into Gym `+` overrides for
  ng_collect_rollouts. Includes:
    * Direct renames (++inference.temperature → +responses_create_params.*)
    * extra_body merging for vLLM-only knobs (top_k, min_p, repetition_penalty,
      random_seed→seed), using dict-literal syntax to avoid the
      `extra_body: null` Hydra override gotcha.
    * Explicit drop list sourced from the migration's manager-review decisions.
    * Strict-mode option for unknown overrides; default is pass-through + warn
      so partial migration doesn't silently lose flags.

Tests: 43 unit tests for the translator covering renames, extra_body merging,
silent drops, server/sandbox drops, rejected overrides, unknown-passthrough
vs strict mode, and a realistic gsm8k combo. Two sentinel tests on ns eval
itself confirm the flag is wired and gym path raises until step 3 lands.

Migration plan: see convert-eval-to-gym/TIERED_PLAN.md (Tier 1 steps 2 & 4).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…only)

Builds the Gym client script and dispatches to it from `ns eval --backend=gym`.
The previous commit added the flag + a NotImplementedError gate; this commit
replaces that gate with real wiring.

- nemo_skills/pipeline/utils/gym/registry.py: tiny benchmark → (config_paths,
  agent_name) registry. Only gsm8k registered for v1. Designed to be replaced
  by the upstream Gym auto-resolver PR once that lands.

- nemo_skills/pipeline/utils/scripts/gym_eval.py: GymEvalClientScript, a peer
  of EvalClientScript. Surface matches (units, servers, sandbox, model_names,
  server_types, server_addresses_prehosted) plus Gym-specific fields
  (config_paths, agent_name, gym_path, policy_api_key, policy_model_name).
  Builds a single shell command:
    1. Resolve Gym install path (importable nemo_gym → container fallback).
    2. `uv sync --active --extra dev` to ensure venv exists.
    3. Wait for the policy server.
    4. Start ng_run with config_paths in background; poll ng_status until
       healthy.
    5. Per unit: ng_collect_rollouts with the translator-rewritten
       extra_arguments; output file name carries -rs{seed}/-chunk{id} suffixes.
    6. Clean up ng_run.
  Reuses NemoGymRolloutsScript's shell scaffolding; sandbox host/port are
  forwarded via env vars (Q5 decision: Skills keeps owning the sandbox).

- Dispatcher in ns eval:
  * Early pre-check: --backend=gym fails fast for benchmarks not in the
    registry, with a clear pointer to --backend=skills as the fallback.
  * One-benchmark-per-job guard: a single SLURM job runs one shared ng_run
    mesh, so we refuse to mix benchmarks (and tell the user how to fix it).
  * Multi-model unsupported on the Gym path (Skills v1 keeps the multi-model
    eval path; raises NotImplementedError if requested).

Tests: 12 new GymEvalClientScript tests (registry, validation, shell-emit
shape, per-seed/per-chunk output paths, translator pipeline, server hostname
ref, sandbox env, multi-unit single-ng_run sharing). Two ns eval sentinel
tests replace the prior NotImplementedError check with a "rejects unregistered
benchmark" check.

Tier 1 step 3 (TIERED_PLAN.md) complete. Remaining: step 5 (metrics adapter)
and step 6 (parity run on a cluster).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Wraps Tier 1 by closing the metrics gap: after each `ng_collect_rollouts`,
convert Gym's `rollouts.jsonl` into a Skills-shape `output.jsonl` so the
existing `summarize_results` task produces the right `metrics.json` without
changes. The dispatcher passes the benchmark's `METRICS_TYPE` through to the
client script; only "math" is implemented for v1 (gsm8k pilot).

- nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter +
  file-level wrapper + argparse CLI suitable for shell invocation. Maps
  math_with_judge rollouts to {generation, predicted_answer, expected_answer,
  symbolic_correct} (plus an opt-in `judgement` placeholder for Tier 2 judge
  benchmarks). Unsupported metric_types raise NotImplementedError with a
  clear pointer.

- GymEvalClientScript: new optional `metric_type` field. When set, each
  per-unit shell block runs the converter immediately after
  ng_collect_rollouts succeeds, writing the Skills-shape output.jsonl next
  to the rollouts.jsonl (with matching -rs{seed}/-chunk{id} suffixes).
  When unset, the post-rollout step is skipped — keeps the script useful
  for non-metric flows.

- ns eval dispatcher: passes `benchmarks_dict[benchmark].metrics_type`
  (or the CLI `--metric_type` override) through to GymEvalClientScript.

Tests: 15 new converter tests (correct/incorrect mapping, missing
library_reward fallback, multi-chunk output_text concat, judge surfacing,
LaTeX expected-answer pass-through, file IO, blank-line handling,
dst-dir creation, CLI exit codes, unknown-type rejection). Three new
GymEvalClientScript tests verify the converter is wired in (only when
metric_type is set; produces -rs{seed}-suffixed Skills outputs; one call
per unit).

Tier 1 step 5 (TIERED_PLAN.md) complete. Remaining: step 6 (parity
run gsm8k Skills vs Gym).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
The Gym client task runs `ng_run` + `ng_collect_rollouts`, which live in
the gym container — the `nemo-skills` container doesn't have them. Without
this the parity run on real clusters would fail at first invocation.

Same container precedence as `nemo_gym_rollouts`: explicit `--main_container`
beats everything; otherwise `containers['nemo-gym']` if present, else
`containers['nemo-rl']`. `--backend=skills` is unchanged.

Tests:
- gym backend with only nemo-rl falls back to it.
- gym backend with nemo-gym key prefers it.
- skills backend still picks nemo-skills (regression guard).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Gym backend support: benchmark registry and translator, GymEvalClientScript orchestration, Gym→Skills rollout converter and CLI, eval CLI backend selection, and comprehensive tests covering registry, translator, converter, script output, and eval wiring.

Changes

Gym Backend Integration

Layer / File(s) Summary
Gym→Skills rollout converter
nemo_skills/pipeline/adapters/*, tests/test_gym_to_skills_adapter.py
New adapter module converts Gym rollouts.jsonl to Skills-shaped output.jsonl (math metric), includes streaming file conversion, CLI, and unit/file/CLI tests.
Gym utilities, registry, and translator
nemo_skills/pipeline/utils/gym/*, tests/test_gym_translator.py
Adds gym utils package entrypoint, immutable GymBenchmarkConfig registry and lookup helpers, and translate_skills_overrides_to_gym with rejection/drop rules and tests.
Gym eval orchestration script
nemo_skills/pipeline/utils/scripts/gym_eval.py, nemo_skills/pipeline/utils/scripts/__init__.py, tests/test_gym_eval_script.py
Adds GymEvalClientScript building an inline shell scaffold to run ng_run/ng_collect_rollouts, translate overrides, optionally convert rollouts, and includes many tests validating emitted shell fragments and behavior.
Eval CLI/backend wiring
nemo_skills/pipeline/eval.py, tests/test_eval.py
Adds EvalBackend enum and --backend option, validates requested gym benchmarks, constructs GymEvalClientScript for gym backend (single benchmark per job), and adjusts client container selection; tests for backend behavior and container selection.

Sequence Diagram(s)

sequenceDiagram
  participant Shell
  participant ng_run
  participant ng_status
  participant ng_collect
  participant Converter
  Shell->>ng_run: start background (config_paths, policy params)
  Shell->>ng_status: poll until healthy
  loop units
    Shell->>ng_collect: run with agent_name, input/output, translated overrides
    ng_collect-->>Shell: rollouts.jsonl
    alt metric_type set
      Shell->>Converter: python -m nemo_skills.pipeline.adapters.gym_to_skills --metric_type=...
      Converter-->>Shell: skills-shaped output.jsonl
    end
  end
  Shell->>ng_run: terminate
Loading

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers:

  • AlexGrinch
  • naymaraq
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main change: adding a Gym backend option for the eval command, with the pilot scope clearly limited to gsm8k via the parenthetical.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch georgea/convert-eval-to-gym

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/test_gym_to_skills_adapter.py (1)

63-123: ⚡ Quick win

Add contract tests for missing required rollout keys

Please add negative tests asserting KeyError for missing required keys (response, extracted_answer, expected_answer, and reward when library_reward is absent`) so schema drift fails loudly.

Suggested tests
 class TestConvertMathRollout:
+    def test_missing_expected_answer_raises(self):
+        rollout = _gym_rollout()
+        del rollout["expected_answer"]
+        with pytest.raises(KeyError):
+            convert_math_rollout(rollout)
+
+    def test_missing_response_raises(self):
+        rollout = _gym_rollout()
+        del rollout["response"]
+        with pytest.raises(KeyError):
+            convert_math_rollout(rollout)

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_gym_to_skills_adapter.py` around lines 63 - 123, Add negative
contract tests to tests/test_gym_to_skills_adapter.py that call
convert_math_rollout with rollouts missing required keys and assert they raise
KeyError: create variants from _gym_rollout then del rollout["response"], del
rollout["extracted_answer"], del rollout["expected_answer"], and a variant where
you remove "library_reward" and also del rollout["reward"] to verify missing
reward raises KeyError; wrap each call in pytest.raises(KeyError) so
convert_math_rollout fails loudly rather than using .get(), referencing
convert_math_rollout and _gym_rollout to locate the test targets.
tests/test_gym_translator.py (1)

202-208: ⚡ Quick win

Add an assertion for quoted passthrough token integrity.

This test currently misses the critical expectation that ++some_string="hello world" remains a single effective argument after translation.

Suggested assertion update
@@
     def test_quoted_values_survive(self):
@@
         result = translate_skills_overrides_to_gym('++inference.temperature=0.7 ++some_string="hello world"')
@@
         assert "+responses_create_params.temperature=0.7" in result
+        assert "some_string" in result
+        assert "hello world" in result
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_gym_translator.py` around lines 202 - 208, The test
test_quoted_values_survive is missing an assertion that the quoted token stays a
single argument; update the test to assert that
translate_skills_overrides_to_gym preserves the quoted passthrough as a single
token by checking for the expected output fragment corresponding to
++some_string="hello world" (use the translate_skills_overrides_to_gym call and
assert the resulting string contains the single-token form for that
passthrough), referencing the test function name and
translate_skills_overrides_to_gym to locate and modify the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/adapters/gym_to_skills.py`:
- Around line 45-54: The converter is silently tolerating missing rollout-schema
keys by using dict.get, so update the code to fail fast: in _extract_output_text
and the related extractor around lines referenced (handling response,
extracted_answer, expected_answer, reward fallback), replace uses of
response.get(...) and similar .get() calls with direct key access (e.g.,
response["output"], item["content"], c["type"], c["text"]) and add explicit,
descriptive exceptions (KeyError or ValueError) when required keys are absent so
schema drift surfaces immediately; ensure any fallback logic for reward is
removed or replaced with a clear error and include the symbol names
(_extract_output_text and the other extractor functions handling
extracted_answer/expected_answer) in your changes.
- Around line 133-151: The convert_file function currently opens output_path for
writing before conversion, risking partial/truncated output on errors; change
convert_file so it first fully reads and converts all rollouts by iterating
convert_rollouts(rollouts, metric_type=metric_type) into an in-memory list (or
stream to a temp file in the same directory), only then open output_path for
writing (or atomically replace it by writing to a temp file and os.replace into
output_path). Ensure you preserve the returned row count n and the LOG.info
message, and reference convert_rollouts, rollouts_path, output_path, and n when
locating where to implement the two-phase convert-then-write behavior.

In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 139-197: The returned command string is built with "
".join(passthrough_parts + gym_parts) which loses shell quoting and splits
tokens containing spaces; replace that final join with
shlex.join(passthrough_parts + gym_parts) (ensure shlex is imported/available)
so tokens like values merged into extra_body remain properly shell-quoted;
update the return expression in the function that builds
passthrough_parts/gym_parts/extra_body (the block that currently ends with
return " ".join(...).strip()) to use shlex.join(...).strip() instead.

---

Nitpick comments:
In `@tests/test_gym_to_skills_adapter.py`:
- Around line 63-123: Add negative contract tests to
tests/test_gym_to_skills_adapter.py that call convert_math_rollout with rollouts
missing required keys and assert they raise KeyError: create variants from
_gym_rollout then del rollout["response"], del rollout["extracted_answer"], del
rollout["expected_answer"], and a variant where you remove "library_reward" and
also del rollout["reward"] to verify missing reward raises KeyError; wrap each
call in pytest.raises(KeyError) so convert_math_rollout fails loudly rather than
using .get(), referencing convert_math_rollout and _gym_rollout to locate the
test targets.

In `@tests/test_gym_translator.py`:
- Around line 202-208: The test test_quoted_values_survive is missing an
assertion that the quoted token stays a single argument; update the test to
assert that translate_skills_overrides_to_gym preserves the quoted passthrough
as a single token by checking for the expected output fragment corresponding to
++some_string="hello world" (use the translate_skills_overrides_to_gym call and
assert the resulting string contains the single-token form for that
passthrough), referencing the test function name and
translate_skills_overrides_to_gym to locate and modify the assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a92f9211-53e0-4a58-9768-930c74ebd2d7

📥 Commits

Reviewing files that changed from the base of the PR and between 467b057 and 4bd1542.

📒 Files selected for processing (12)
  • nemo_skills/pipeline/adapters/__init__.py
  • nemo_skills/pipeline/adapters/gym_to_skills.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/gym/__init__.py
  • nemo_skills/pipeline/utils/gym/registry.py
  • nemo_skills/pipeline/utils/gym/translator.py
  • nemo_skills/pipeline/utils/scripts/__init__.py
  • nemo_skills/pipeline/utils/scripts/gym_eval.py
  • tests/test_eval.py
  • tests/test_gym_eval_script.py
  • tests/test_gym_to_skills_adapter.py
  • tests/test_gym_translator.py

Comment thread nemo_skills/adapters/gym_to_skills.py Outdated
Comment on lines +45 to +54
def _extract_output_text(response: Dict[str, Any]) -> str:
"""Pull the assistant-message text out of an OpenAI Responses-API response."""
chunks: List[str] = []
for item in response.get("output", []) or []:
for c in item.get("content", []) or []:
if c.get("type") == "output_text":
text = c.get("text")
if text:
chunks.append(text)
return "".join(chunks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required rollout schema keys

This converter silently degrades required fields (response, extracted_answer, expected_answer, and reward fallback), which can emit corrupted Skills rows instead of surfacing schema drift immediately.

Proposed fix
 def _extract_output_text(response: Dict[str, Any]) -> str:
     """Pull the assistant-message text out of an OpenAI Responses-API response."""
     chunks: List[str] = []
-    for item in response.get("output", []) or []:
-        for c in item.get("content", []) or []:
-            if c.get("type") == "output_text":
-                text = c.get("text")
+    for item in response["output"]:
+        for c in item["content"]:
+            if c["type"] == "output_text":
+                text = c["text"]
                 if text:
                     chunks.append(text)
     return "".join(chunks)
@@
-    library_reward = rollout.get("library_reward")
-    if library_reward is None:
+    library_reward = rollout["library_reward"] if "library_reward" in rollout else rollout["reward"]
+    if library_reward is None:
         # Fall back to the overall reward when the server doesn't expose the
         # symbolic-only signal (we lose the judge/symbolic separation but at
         # least we get a pass/fail).
-        library_reward = rollout.get("reward", 0.0)
+        library_reward = rollout["reward"]
@@
     out: Dict[str, Any] = {
-        "generation": _extract_output_text(rollout.get("response", {}) or {}),
-        "predicted_answer": rollout.get("extracted_answer"),
-        "expected_answer": rollout.get("expected_answer"),
+        "generation": _extract_output_text(rollout["response"]),
+        "predicted_answer": rollout["extracted_answer"],
+        "expected_answer": rollout["expected_answer"],
         "symbolic_correct": bool(library_reward == 1.0),
     }

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 70-81

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 45 - 54, The
converter is silently tolerating missing rollout-schema keys by using dict.get,
so update the code to fail fast: in _extract_output_text and the related
extractor around lines referenced (handling response, extracted_answer,
expected_answer, reward fallback), replace uses of response.get(...) and similar
.get() calls with direct key access (e.g., response["output"], item["content"],
c["type"], c["text"]) and add explicit, descriptive exceptions (KeyError or
ValueError) when required keys are absent so schema drift surfaces immediately;
ensure any fallback logic for reward is removed or replaced with a clear error
and include the symbol names (_extract_output_text and the other extractor
functions handling extracted_answer/expected_answer) in your changes.

Comment thread nemo_skills/adapters/gym_to_skills.py Outdated
Comment on lines +133 to +151
def convert_file(
rollouts_path: str | Path,
output_path: str | Path,
*,
metric_type: str,
) -> int:
"""Read `rollouts_path`, write Skills-shape JSONL to `output_path`. Returns row count."""
rollouts_path = Path(rollouts_path)
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)

n = 0
with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout:
rollouts = (json.loads(line) for line in fin if line.strip())
for prediction in convert_rollouts(rollouts, metric_type=metric_type):
fout.write(json.dumps(prediction))
fout.write("\n")
n += 1
LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Protect output files from partial writes on conversion failures

convert_file truncates/writes output_path while parsing/converting is still running. If a malformed JSON line or schema error occurs, callers can be left with partial output.

Proposed fix
+import tempfile
@@
 def convert_file(
@@
 ) -> int:
@@
-    n = 0
-    with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout:
-        rollouts = (json.loads(line) for line in fin if line.strip())
-        for prediction in convert_rollouts(rollouts, metric_type=metric_type):
-            fout.write(json.dumps(prediction))
-            fout.write("\n")
-            n += 1
+    n = 0
+    with rollouts_path.open("r", encoding="utf-8") as fin:
+        rollouts = (json.loads(line) for line in fin if line.strip())
+        converted = list(convert_rollouts(rollouts, metric_type=metric_type))
+
+    with tempfile.NamedTemporaryFile("w", encoding="utf-8", dir=output_path.parent, delete=False) as tmp:
+        tmp_path = Path(tmp.name)
+        for prediction in converted:
+            tmp.write(json.dumps(prediction))
+            tmp.write("\n")
+            n += 1
+
+    tmp_path.replace(output_path)
     LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path)
     return n

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 133 - 151, The
convert_file function currently opens output_path for writing before conversion,
risking partial/truncated output on errors; change convert_file so it first
fully reads and converts all rollouts by iterating convert_rollouts(rollouts,
metric_type=metric_type) into an in-memory list (or stream to a temp file in the
same directory), only then open output_path for writing (or atomically replace
it by writing to a temp file and os.replace into output_path). Ensure you
preserve the returned row count n and the LOG.info message, and reference
convert_rollouts, rollouts_path, output_path, and n when locating where to
implement the two-phase convert-then-write behavior.

Comment on lines +139 to +197
for token in shlex.split(extra_arguments):
key, value, prefix = _parse_token(token)

# Non-override token (e.g. `--config-name foo`): pass through unchanged.
if prefix is None:
passthrough_parts.append(token)
continue

if key in _REJECTED:
raise UnsupportedSkillsOverrideError(f"{key}: {_REJECTED[key]}")

if key in _SILENTLY_DROPPED:
LOG.debug("Dropping Skills override %s (%s)", key, _SILENTLY_DROPPED[key])
continue

if any(key.startswith(p) for p in _DROPPED_PREFIXES):
matched_prefix = next(p for p in _DROPPED_PREFIXES if key.startswith(p))
LOG.debug("Dropping Skills override %s (%s)", key, _DROPPED_PREFIXES[matched_prefix])
continue

if key in _EXTRA_BODY_KEYS:
extra_body[_EXTRA_BODY_KEYS[key]] = value
continue

if key in _DIRECT_RENAMES:
gym_parts.append(f"+{_DIRECT_RENAMES[key]}={value}")
continue

# `++inference.extra_body.foo=bar` → merge into extra_body dict.
if key.startswith("inference.extra_body."):
sub_key = key[len("inference.extra_body.") :]
extra_body[sub_key] = value
continue

# Server / sandbox overrides are handled by the pipeline layer
# (model-server YAML assembly), not by ng_collect_rollouts. Drop them
# here so they don't end up as bogus Gym flags.
if key.startswith("server.") or key.startswith("sandbox."):
LOG.debug("Dropping Skills override %s (handled by pipeline layer)", key)
continue

if strict_unknown:
raise UnsupportedSkillsOverrideError(
f"No Gym mapping for Skills override `++{key}`. "
f"Add it to nemo_skills/pipeline/utils/gym/translator.py "
f"or use --backend=skills."
)
LOG.warning("Unmapped Skills override `++%s` passed through unchanged", key)
passthrough_parts.append(token)

if extra_body:
# Dict-literal syntax to dodge the `extra_body: null` Hydra gotcha
# (vllm_model.yaml ships extra_body: null, which silently no-ops
# `++…extra_body.foo=bar`). See memory/feedback_hydra_extra_body_override.md.
body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items()))
gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}")

return " ".join(passthrough_parts + gym_parts).strip()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant section
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '130,205p'

Repository: NVIDIA-NeMo/Skills

Length of output: 3631


🏁 Script executed:

# Also check the function signature to understand what this is part of
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '120,205p'

Repository: NVIDIA-NeMo/Skills

Length of output: 4138


🏁 Script executed:

# Search for usages of this function's return value to understand how output is consumed
rg "translate_skills_args|_translate_overrides" nemo_skills/ -A 5 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Check the function signature and docstring to understand return value requirements
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '100,130p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1454


🏁 Script executed:

# Check what Python version this project targets
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | head -5 | xargs cat 2>/dev/null | grep -A 5 -B 5 "python"

Repository: NVIDIA-NeMo/Skills

Length of output: 1627


🏁 Script executed:

# Search for actual usage of the translate_skills_args function in shell contexts
rg "translate_skills_args\(" nemo_skills/ -B 3 -A 3

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Search for actual usage and how the return value is consumed
rg "translate_skills_overrides_to_gym" --type py -B 2 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 13733


🏁 Script executed:

# Check if shlex.join is used elsewhere in the codebase
rg "shlex\.join" nemo_skills/

Repository: NVIDIA-NeMo/Skills

Length of output: 395


🏁 Script executed:

# Look for subprocess calls or shell command construction that might use this function's output
rg "ng_collect_rollouts" --type py -B 5 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 21763


🏁 Script executed:

# Verify the specific issue: check how gym_eval.py constructs the full command
cat -n nemo_skills/pipeline/utils/scripts/gym_eval.py | sed -n '245,290p'

Repository: NVIDIA-NeMo/Skills

Length of output: 2275


🏁 Script executed:

# Look at a concrete scenario: what happens with values containing spaces in extra_body
python3 <<'EOF'
import shlex

# Simulate the scenario
key = "foo"
value = "hello world"  # Value with space

# Current approach (what the code does)
gym_parts = []
extra_body = {}
extra_body["foo"] = "hello world"
body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items()))
gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}")

result_current = " ".join(gym_parts).strip()
print(f"Current result: {result_current}")
print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_current)}")

# With shlex.join
result_proper = shlex.join(gym_parts).strip()
print(f"\nWith shlex.join: {result_proper}")
print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_proper)}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 410


Preserve shell token boundaries when rebuilding translated args.

Line 196 uses plain string join after shlex.split(), which fails to quote tokens containing spaces. When values are merged into the extra_body dict (e.g., "hello world"), the resulting string gets split into multiple tokens when parsed as shell arguments rather than a single argument. The docstring promises a "shell-quoted string" but the implementation doesn't deliver it.

Use shlex.join() instead, which is available in Python 3.8+ (project requires >=3.10) and already used elsewhere in the codebase:

-    return " ".join(passthrough_parts + gym_parts).strip()
+    return shlex.join(passthrough_parts + gym_parts).strip()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 139 - 197, The
returned command string is built with " ".join(passthrough_parts + gym_parts)
which loses shell quoting and splits tokens containing spaces; replace that
final join with shlex.join(passthrough_parts + gym_parts) (ensure shlex is
imported/available) so tokens like values merged into extra_body remain properly
shell-quoted; update the return expression in the function that builds
passthrough_parts/gym_parts/extra_body (the block that currently ends with
return " ".join(...).strip()) to use shlex.join(...).strip() instead.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_eval.py (1)

349-598: 💤 Low value

Consider extracting shared setup into a fixture.

The three container selection tests share substantial setup code (cluster_config, benchmark_args, unit creation, monkeypatches). A parameterized fixture or helper function could reduce duplication and make the test intent clearer. That said, the current implementation is readable and self-contained.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval.py` around lines 349 - 598, Extract the repeated setup in
test_gym_backend_picks_gym_container_falls_back_to_nemo_rl,
test_gym_backend_prefers_nemo_gym_container_when_present, and
test_skills_backend_keeps_using_nemo_skills_container into a shared pytest
fixture or helper function: move creation of cluster_config, benchmark_args,
unit, fake_command_init, and the repeated monkeypatch.setattr calls for
eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths,
get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and
Pipeline.run into that fixture/helper and have each test call it (or parametrize
container mappings and backend); keep assertions (captured["client_container"])
and calls to eval_pipeline.eval in the individual tests and reference the same
symbols (cluster_config, benchmark_args, unit, fake_command_init,
prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_eval.py`:
- Around line 349-598: Extract the repeated setup in
test_gym_backend_picks_gym_container_falls_back_to_nemo_rl,
test_gym_backend_prefers_nemo_gym_container_when_present, and
test_skills_backend_keeps_using_nemo_skills_container into a shared pytest
fixture or helper function: move creation of cluster_config, benchmark_args,
unit, fake_command_init, and the repeated monkeypatch.setattr calls for
eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths,
get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and
Pipeline.run into that fixture/helper and have each test call it (or parametrize
container mappings and backend); keep assertions (captured["client_container"])
and calls to eval_pipeline.eval in the individual tests and reference the same
symbols (cluster_config, benchmark_args, unit, fake_command_init,
prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bba08dfd-6a14-48fb-bd19-a13a4aabf801

📥 Commits

Reviewing files that changed from the base of the PR and between 4bd1542 and b2437b5.

📒 Files selected for processing (2)
  • nemo_skills/pipeline/eval.py
  • tests/test_eval.py

Skills and Gym ship per-benchmark input JSONLs with different schemas — for
gsm8k, Skills uses `problem` and Gym uses `question`, so feeding Skills'
test.jsonl into ng_collect_rollouts via the resource server's
`apply_prompt_to_row` produces empty prompts. The unit tests passed because
they never exercised the schema dependency.

- Add `input_jsonl_fpath` to GymBenchmarkConfig (required).
- New `gym_input_jsonl_fpath` field on GymEvalClientScript: when set, the
  ng_collect_rollouts input file is the resolved `$GYM_PATH/<rel-path>` at
  runtime instead of the unit's Skills-side input.
- Dispatcher in ns eval threads the registry value through.

Tests: override forces the Gym path; absence falls back to the unit's
input_file (kept for back-compat / non-benchmark uses).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Skills' `++prompt_config=generic/math` is a path under
nemo_skills/prompt/config/. The translator was passing it through as an
unknown override; Gym's `ng_collect_rollouts` accepts a `prompt_config`
field but interprets it as a literal file path under the Gym install dir,
so the smoke run crashed with `FileNotFoundError: /opt/Gym/generic/math`.

In Gym the prompt config lives in the agent yaml's `datasets[]` entry —
nothing should override it via CLI. Add prompt_config and prompt_template
(used by some judge benchmarks) to the silent-drop list.

Tests: parametrised drop coverage, plus the gsm8k realistic-combo
regression test now asserts no `prompt_config` leaks through.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/pipeline/utils/gym/translator.py (1)

145-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use shlex.join to preserve argument boundaries.

Line 202 rebuilds argv with plain " ".join(...), which drops shell quoting and can corrupt overrides containing spaces.

Proposed fix
-    return " ".join(passthrough_parts + gym_parts).strip()
+    return shlex.join(passthrough_parts + gym_parts).strip()
#!/bin/bash
# Verify current implementation and reproduce token-boundary breakage.
set -euo pipefail

echo "1) Confirm return expression:"
nl -ba nemo_skills/pipeline/utils/gym/translator.py | sed -n '190,206p'

echo
echo "2) Demonstrate why plain join is lossy:"
python - <<'PY'
import shlex
parts = ["+responses_create_params.extra_body={note: hello world}", "--flag", "x"]
plain = " ".join(parts)
safe = shlex.join(parts)
print("plain :", plain)
print("split :", shlex.split(plain))
print("safe  :", safe)
print("split :", shlex.split(safe))
assert shlex.split(safe) == parts
PY
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 145 - 202, The
return value is rebuilt with " ".join(passthrough_parts + gym_parts) which loses
shell quoting and can break tokens with spaces; change that final join to use
shlex.join(passthrough_parts + gym_parts) so argument boundaries are preserved
(and add/import shlex at top if not already imported); keep the same list order
(passthrough_parts then gym_parts) and ensure the existing variables
passthrough_parts and gym_parts are used unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 145-202: The return value is rebuilt with "
".join(passthrough_parts + gym_parts) which loses shell quoting and can break
tokens with spaces; change that final join to use shlex.join(passthrough_parts +
gym_parts) so argument boundaries are preserved (and add/import shlex at top if
not already imported); keep the same list order (passthrough_parts then
gym_parts) and ensure the existing variables passthrough_parts and gym_parts are
used unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b447759-28d9-480c-936f-eb997cf6986d

📥 Commits

Reviewing files that changed from the base of the PR and between 85d1b1c and 53a2530.

📒 Files selected for processing (2)
  • nemo_skills/pipeline/utils/gym/translator.py
  • tests/test_gym_translator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_gym_translator.py

The Skills-derived input JSONL carries raw fields (question/expected_answer)
rather than pre-built responses_create_params, so ng_collect_rollouts has to
render the prompt itself. The Gym agent yaml lists the prompt path inside
`datasets[].prompt_config`, but the CLI doesn't read it from there — it has
to be passed as +prompt_config explicitly.

- Add `prompt_config` field (required) to GymBenchmarkConfig; populate it
  for gsm8k from the agent yaml's value.
- New optional `gym_prompt_config` on GymEvalClientScript: when set,
  injects `+prompt_config="$GYM_PATH"/<rel-path>` so the path resolves
  against the same Gym dir as everything else at runtime.
- Dispatcher in ns eval threads the registry value through.

Without this fix the rollout crashes with `KeyError: 'responses_create_params'`
during input preprocessing (rollout_collection.py:216).

Tests: injection when set; absent when not.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 570bec36-f234-4610-beef-debb514494bd

📥 Commits

Reviewing files that changed from the base of the PR and between 53a2530 and 57c9f71.

📒 Files selected for processing (4)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/gym/registry.py
  • nemo_skills/pipeline/utils/scripts/gym_eval.py
  • tests/test_gym_eval_script.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemo_skills/pipeline/utils/gym/registry.py
  • tests/test_gym_eval_script.py
  • nemo_skills/pipeline/utils/scripts/gym_eval.py

Comment on lines +619 to +650
if backend == EvalBackend.gym.value:
# Gym runs one shared `ng_run` mesh per SLURM job, so all units
# in a job must share the same Gym wiring. Enforce one benchmark
# per job rather than per-unit mesh-switching.
if len(job_benchmarks) > 1:
raise ValueError(
f"--backend=gym requires one benchmark per SLURM job; this job "
f"would mix {sorted(job_benchmarks)}. Re-run with --num_jobs=-1 "
f"(or large enough that each benchmark gets its own job)."
)
(gym_benchmark,) = job_benchmarks
gym_cfg: GymBenchmarkConfig = get_gym_config(gym_benchmark)
# Effective metric_type for the gym→skills converter: prefer
# the CLI override, else the benchmark's METRICS_TYPE module
# constant. Same precedence summarize_results uses.
gym_metric_type = metric_type or benchmarks_dict[gym_benchmark].metrics_type
client_script = GymEvalClientScript(
units=unit_dicts,
config_paths=list(gym_cfg.config_paths),
agent_name=gym_cfg.agent_name,
gym_input_jsonl_fpath=gym_cfg.input_jsonl_fpath,
gym_prompt_config=gym_cfg.prompt_config,
metric_type=gym_metric_type,
single_node_mode=single_node_mode,
with_sandbox=sandbox_enabled,
servers=server_scripts,
server_addresses_prehosted=server_addresses_list,
model_names=models_list,
server_types=server_types_list,
sandbox=sandbox_script,
installation_command=installation_command,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unsupported Gym modes here.

Lines 619-650 only enforce “one benchmark per job”. --backend=gym still accepts multi-model inputs and chunked eval units, even though this path is scoped to a single model without chunking. Please fail fast before constructing GymEvalClientScript; otherwise those user flags are accepted and the failure moves downstream.

Suggested guard
             if backend == EvalBackend.gym.value:
+                if num_models != 1:
+                    raise NotImplementedError(
+                        "--backend=gym currently supports exactly one model per evaluation job."
+                    )
+                if any(unit["chunk_id"] is not None or (unit["num_chunks"] or 1) > 1 for unit in unit_dicts):
+                    raise NotImplementedError(
+                        "--backend=gym does not support --num_chunks/--chunk_ids yet."
+                    )
                 # Gym runs one shared `ng_run` mesh per SLURM job, so all units
                 # in a job must share the same Gym wiring. Enforce one benchmark
                 # per job rather than per-unit mesh-switching.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

ng_collect_rollouts writes a sidecar `<stem>_materialized_inputs.jsonl`
next to its output. The parent directory has to exist before that, but
neither nemo_run nor the GymEvalClientScript was creating it, so the
smoke run hit:

    FileNotFoundError: [Errno 2] No such file or directory:
    '<output_dir>/rollouts_materialized_inputs.jsonl'

Mirrors NemoGymRolloutsScript which does the same mkdir.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@gwarmstrong gwarmstrong marked this pull request as draft May 18, 2026 20:48
The post-rollout converter runs inside the Gym container, which only has
`nemo-skills-core` installed (no `nemo_run`). The adapter previously lived
under `nemo_skills.pipeline.adapters`; importing it triggered the pipeline
package's `nemo_run` guard:

    ImportError: nemo-skills-core is installed but pipeline functionality
    requires the full package.

Move to `nemo_skills.adapters` (top-level). The adapter has no pipeline
dependencies, so this is purely about being reachable from the core install.

Updates the `python -m` invocation in GymEvalClientScript and all test
imports / shell-emit assertions.

Surfaced by the smoke run: the rollout itself completed at 90% symbolic
accuracy on 10 problems; only the post-step conversion failed to start.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Hydra accepts `+responses_create_params.extra_body={seed: 0}` as a single
override, but the space inside the dict makes bash word-split it into two
args. Hydra then sees `+…extra_body={seed:` and `0}` separately and dies
with `no viable alternative at input '{seed:'`. Surfaced live on a 2-seed
parity run (seed=0/1 → extra_body emitted; the smoke didn't trip it
because no extra_body was generated when random_seed=None).

Wrap both call sites in double quotes:
- Translator (when min_p/top_k/repetition_penalty/random_seed produce
  extra_body, plus any explicit `++inference.extra_body.foo=bar`).
- GymEvalClientScript (when per-unit random_seed surfaces as
  `extra_body={seed: N}`).

Tests: existing extra_body assertions updated to expect the quoted form;
new regression test for the per-seed shell-quoting case.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…forbid'

I'd been routing Skills' vLLM-only knobs (top_k, min_p, repetition_penalty,
random_seed) and the per-job random_seed through
`+responses_create_params.extra_body={…}`. That's wrong: Gym's
`NeMoGymResponseCreateParamsNonStreaming` is `extra="forbid"` and has no
`extra_body` field on the per-call schema. The 2-seed parity run on the
full benchmark surfaced this — every rollout 422'd with:

    "type": "extra_forbidden",
    "loc": ["body", "responses_create_params", "extra_body"]

(The smoke didn't trip it because no extra_body was emitted when
random_seed=None and the recipe didn't include any vLLM-only knobs.)

The `extra_body` Hydra-override memory was about the *model-server* config
field (`vllm_model.extra_body`) which is static / per-server, not per-call.
The right home for these knobs is overriding the model-server config at
ng_run time, not threading per-call args through ng_collect_rollouts.

Changes:
- Translator: vLLM-only knobs become silent-with-warning drops. Document
  the proper home in the warning text. Same for ++inference.extra_body.*.
- GymEvalClientScript: stop injecting `extra_body={seed: N}` per unit. Per-seed
  reproducibility is a known gap — Skills' two seeds and Gym's two seeds
  will produce uncorrelated variance, but mean@N statistics still compare.

Tests updated accordingly + regression assertions that `extra_body`/`seed:`
never leak into the emitted shell command.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Scope correction: the migration intentionally breaks the Skills
`output.jsonl` / `metrics.json` contract. Gym's native `rollouts.jsonl`
+ `rollouts_aggregate_metrics.json` are the canonical outputs on the
`--backend=gym` path.

Removes:
- `nemo_skills/adapters/gym_to_skills.py` and its package (whole adapter).
- `tests/test_gym_to_skills_adapter.py` (18 tests).
- `metric_type` field, `_convert_cmd` method, and the
  `_skills_output_jsonl_for_unit` helper on `GymEvalClientScript`. The
  per-unit shell block is now just ng_collect_rollouts (no post-step).
- The dispatcher's `gym_metric_type` plumbing.

Changes:
- `ns eval` skips the `summarize_results` SLURM task on the gym path —
  Gym already writes its aggregate metrics next to the rollouts JSONL,
  and `summarize_results` expects a Skills schema we no longer produce.

Tests:
- Replace the converter-wiring tests with a single negative-regression
  ("no skills schema artifacts in the emitted shell") on
  GymEvalClientScript.
- New `test_eval.py::test_gym_backend_skips_summarize_results_task`
  asserts the dispatcher doesn't schedule the dependent summarize job
  for `--backend=gym`.

The 2.46pp pass@1 gap reported in the bring-up parity run was an artefact
of Skills' MathMetrics processing Gym's lenient `extracted_answer`. With
this change we no longer feed Gym data through Skills' MathMetrics, so
that gap doesn't apply.

83 tests pass.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
All four wrap `math_with_judge` with a per-benchmark agent of the same
shape as gsm8k. Factor `_math_with_judge_entry()` so each registration
is one line plus an optional prompt_config override (aime/hmmt ship their
own per-benchmark prompt YAML; gsm8k and hendrycks_math share the
shared `benchmarks/prompts/generic_math.yaml`).

Tests: parametrized check that each entry resolves to the right
config_paths / agent_name / input_jsonl_fpath / prompt_config — guards
against typos in the path layout.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Upstream Gym restructured `benchmarks/prompts/`: the flat
`generic_math.yaml`, `math_answer_judge.yaml`, etc. moved to
sub-directories (`generic/math.yaml`, `judge/math.yaml`, …). The
per-benchmark prompt YAMLs we'd been pointing aime24/aime25/hmmt_feb25
at (`benchmarks/<bench>/prompts/default.yaml`) were also removed —
every math_with_judge benchmark now uses the shared
`benchmarks/prompts/generic/math.yaml`.

Update all five registry entries to the new path, simplify the helper's
default to match, and tighten the parametrized registry test to assert
the shared prompt config.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Phase 3 starts diversifying metric types beyond math. Neither benchmark
fits the `math_with_judge` helper (agent names use `_mcqa_` / `_benchmark_`
infix, not `_math_with_judge_`), so they're registered as standalone
GymBenchmarkConfig entries.

- gpqa → resources_servers/mcqa with `lenient_answer_colon_md` grading;
  data file is `gpqa_diamond_benchmark.jsonl` (Gym ships the diamond
  split as the default).
- ifbench → resources_servers/ifbench; agent name has `_benchmark_` infix
  per the Gym yaml convention.

livecodebench deferred — has v5/v6 split, needs sandbox, and uses the
code_gen resource server. Separate cycle.

Tests: per-benchmark registry checks asserting agent_name / input file /
prompt path.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…ompt path

- livecodebench: Skills' `EVAL_SPLIT="test_v6_2408_2505"` maps to Gym's
  `benchmarks/livecodebench/v6_2408_2505/` config. Code execution is
  handled inside Gym's code_gen resource server (Ray subprocess pool),
  so this entry doesn't set Skills' `--with-sandbox`.
- gpqa: upstream Gym moved `benchmarks/<bench>/prompts/default.yaml` to
  shared `benchmarks/prompts/eval/aai/mcq-4choices.yaml`. Update the
  registry entry + test to point at the current path.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Replace the hand-rolled 8-entry dict with the full set of 57 benchmarks
whose Gym `benchmarks/<name>/config.yaml` maps to a Skills dataset module.
Entries are generated by `tools/discover_registry.py` in the recipe root.
Regenerate after any upstream Gym pull.

Most are registered but unvalidated — adding the entry just unblocks
`--backend=gym` to try them. Each will surface its own quirks (metric
naming, container deps, schema mismatches) when actually run. Known
already-validated set: gsm8k, aime24/25, hmmt_feb25, hendrycks_math
(math), ifbench (IF), gpqa (mcqa). livecodebench has the registry entry
but live-validation is still pending.

Also relaxed `GymBenchmarkConfig.prompt_config` to `Optional[str]` since
some Gym benchmarks (aalcr, mrcr) ship without one — input rows already
carry `responses_create_params.input`.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant