feat(eval): --backend=gym pilot for ns eval (gsm8k)#1454
Conversation
…ranslator
Scaffolds the path for routing `ns eval` through NeMo-Gym. No behavior change
when --backend=skills (the default); --backend=gym raises NotImplementedError
until GymEvalClientScript is wired up.
- Adds EvalBackend enum + --backend typer flag on ns eval.
- Adds nemo_skills/pipeline/utils/gym/translator.py: a pure function that
rewrites Skills `++` Hydra overrides into Gym `+` overrides for
ng_collect_rollouts. Includes:
* Direct renames (++inference.temperature → +responses_create_params.*)
* extra_body merging for vLLM-only knobs (top_k, min_p, repetition_penalty,
random_seed→seed), using dict-literal syntax to avoid the
`extra_body: null` Hydra override gotcha.
* Explicit drop list sourced from the migration's manager-review decisions.
* Strict-mode option for unknown overrides; default is pass-through + warn
so partial migration doesn't silently lose flags.
Tests: 43 unit tests for the translator covering renames, extra_body merging,
silent drops, server/sandbox drops, rejected overrides, unknown-passthrough
vs strict mode, and a realistic gsm8k combo. Two sentinel tests on ns eval
itself confirm the flag is wired and gym path raises until step 3 lands.
Migration plan: see convert-eval-to-gym/TIERED_PLAN.md (Tier 1 steps 2 & 4).
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…only)
Builds the Gym client script and dispatches to it from `ns eval --backend=gym`.
The previous commit added the flag + a NotImplementedError gate; this commit
replaces that gate with real wiring.
- nemo_skills/pipeline/utils/gym/registry.py: tiny benchmark → (config_paths,
agent_name) registry. Only gsm8k registered for v1. Designed to be replaced
by the upstream Gym auto-resolver PR once that lands.
- nemo_skills/pipeline/utils/scripts/gym_eval.py: GymEvalClientScript, a peer
of EvalClientScript. Surface matches (units, servers, sandbox, model_names,
server_types, server_addresses_prehosted) plus Gym-specific fields
(config_paths, agent_name, gym_path, policy_api_key, policy_model_name).
Builds a single shell command:
1. Resolve Gym install path (importable nemo_gym → container fallback).
2. `uv sync --active --extra dev` to ensure venv exists.
3. Wait for the policy server.
4. Start ng_run with config_paths in background; poll ng_status until
healthy.
5. Per unit: ng_collect_rollouts with the translator-rewritten
extra_arguments; output file name carries -rs{seed}/-chunk{id} suffixes.
6. Clean up ng_run.
Reuses NemoGymRolloutsScript's shell scaffolding; sandbox host/port are
forwarded via env vars (Q5 decision: Skills keeps owning the sandbox).
- Dispatcher in ns eval:
* Early pre-check: --backend=gym fails fast for benchmarks not in the
registry, with a clear pointer to --backend=skills as the fallback.
* One-benchmark-per-job guard: a single SLURM job runs one shared ng_run
mesh, so we refuse to mix benchmarks (and tell the user how to fix it).
* Multi-model unsupported on the Gym path (Skills v1 keeps the multi-model
eval path; raises NotImplementedError if requested).
Tests: 12 new GymEvalClientScript tests (registry, validation, shell-emit
shape, per-seed/per-chunk output paths, translator pipeline, server hostname
ref, sandbox env, multi-unit single-ng_run sharing). Two ns eval sentinel
tests replace the prior NotImplementedError check with a "rejects unregistered
benchmark" check.
Tier 1 step 3 (TIERED_PLAN.md) complete. Remaining: step 5 (metrics adapter)
and step 6 (parity run on a cluster).
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Wraps Tier 1 by closing the metrics gap: after each `ng_collect_rollouts`,
convert Gym's `rollouts.jsonl` into a Skills-shape `output.jsonl` so the
existing `summarize_results` task produces the right `metrics.json` without
changes. The dispatcher passes the benchmark's `METRICS_TYPE` through to the
client script; only "math" is implemented for v1 (gsm8k pilot).
- nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter +
file-level wrapper + argparse CLI suitable for shell invocation. Maps
math_with_judge rollouts to {generation, predicted_answer, expected_answer,
symbolic_correct} (plus an opt-in `judgement` placeholder for Tier 2 judge
benchmarks). Unsupported metric_types raise NotImplementedError with a
clear pointer.
- GymEvalClientScript: new optional `metric_type` field. When set, each
per-unit shell block runs the converter immediately after
ng_collect_rollouts succeeds, writing the Skills-shape output.jsonl next
to the rollouts.jsonl (with matching -rs{seed}/-chunk{id} suffixes).
When unset, the post-rollout step is skipped — keeps the script useful
for non-metric flows.
- ns eval dispatcher: passes `benchmarks_dict[benchmark].metrics_type`
(or the CLI `--metric_type` override) through to GymEvalClientScript.
Tests: 15 new converter tests (correct/incorrect mapping, missing
library_reward fallback, multi-chunk output_text concat, judge surfacing,
LaTeX expected-answer pass-through, file IO, blank-line handling,
dst-dir creation, CLI exit codes, unknown-type rejection). Three new
GymEvalClientScript tests verify the converter is wired in (only when
metric_type is set; produces -rs{seed}-suffixed Skills outputs; one call
per unit).
Tier 1 step 5 (TIERED_PLAN.md) complete. Remaining: step 6 (parity
run gsm8k Skills vs Gym).
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
The Gym client task runs `ng_run` + `ng_collect_rollouts`, which live in the gym container — the `nemo-skills` container doesn't have them. Without this the parity run on real clusters would fail at first invocation. Same container precedence as `nemo_gym_rollouts`: explicit `--main_container` beats everything; otherwise `containers['nemo-gym']` if present, else `containers['nemo-rl']`. `--backend=skills` is unchanged. Tests: - gym backend with only nemo-rl falls back to it. - gym backend with nemo-gym key prefers it. - skills backend still picks nemo-skills (regression guard). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Gym backend support: benchmark registry and translator, GymEvalClientScript orchestration, Gym→Skills rollout converter and CLI, eval CLI backend selection, and comprehensive tests covering registry, translator, converter, script output, and eval wiring. ChangesGym Backend Integration
Sequence Diagram(s) sequenceDiagram
participant Shell
participant ng_run
participant ng_status
participant ng_collect
participant Converter
Shell->>ng_run: start background (config_paths, policy params)
Shell->>ng_status: poll until healthy
loop units
Shell->>ng_collect: run with agent_name, input/output, translated overrides
ng_collect-->>Shell: rollouts.jsonl
alt metric_type set
Shell->>Converter: python -m nemo_skills.pipeline.adapters.gym_to_skills --metric_type=...
Converter-->>Shell: skills-shaped output.jsonl
end
end
Shell->>ng_run: terminate
Estimated code review effort: Suggested reviewers:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_gym_to_skills_adapter.py (1)
63-123: ⚡ Quick winAdd contract tests for missing required rollout keys
Please add negative tests asserting
KeyErrorfor missing required keys (response,extracted_answer,expected_answer, and reward whenlibrary_rewardis absent`) so schema drift fails loudly.Suggested tests
class TestConvertMathRollout: + def test_missing_expected_answer_raises(self): + rollout = _gym_rollout() + del rollout["expected_answer"] + with pytest.raises(KeyError): + convert_math_rollout(rollout) + + def test_missing_response_raises(self): + rollout = _gym_rollout() + del rollout["response"] + with pytest.raises(KeyError): + convert_math_rollout(rollout)As per coding guidelines, "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_gym_to_skills_adapter.py` around lines 63 - 123, Add negative contract tests to tests/test_gym_to_skills_adapter.py that call convert_math_rollout with rollouts missing required keys and assert they raise KeyError: create variants from _gym_rollout then del rollout["response"], del rollout["extracted_answer"], del rollout["expected_answer"], and a variant where you remove "library_reward" and also del rollout["reward"] to verify missing reward raises KeyError; wrap each call in pytest.raises(KeyError) so convert_math_rollout fails loudly rather than using .get(), referencing convert_math_rollout and _gym_rollout to locate the test targets.tests/test_gym_translator.py (1)
202-208: ⚡ Quick winAdd an assertion for quoted passthrough token integrity.
This test currently misses the critical expectation that
++some_string="hello world"remains a single effective argument after translation.Suggested assertion update
@@ def test_quoted_values_survive(self): @@ result = translate_skills_overrides_to_gym('++inference.temperature=0.7 ++some_string="hello world"') @@ assert "+responses_create_params.temperature=0.7" in result + assert "some_string" in result + assert "hello world" in result🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_gym_translator.py` around lines 202 - 208, The test test_quoted_values_survive is missing an assertion that the quoted token stays a single argument; update the test to assert that translate_skills_overrides_to_gym preserves the quoted passthrough as a single token by checking for the expected output fragment corresponding to ++some_string="hello world" (use the translate_skills_overrides_to_gym call and assert the resulting string contains the single-token form for that passthrough), referencing the test function name and translate_skills_overrides_to_gym to locate and modify the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/adapters/gym_to_skills.py`:
- Around line 45-54: The converter is silently tolerating missing rollout-schema
keys by using dict.get, so update the code to fail fast: in _extract_output_text
and the related extractor around lines referenced (handling response,
extracted_answer, expected_answer, reward fallback), replace uses of
response.get(...) and similar .get() calls with direct key access (e.g.,
response["output"], item["content"], c["type"], c["text"]) and add explicit,
descriptive exceptions (KeyError or ValueError) when required keys are absent so
schema drift surfaces immediately; ensure any fallback logic for reward is
removed or replaced with a clear error and include the symbol names
(_extract_output_text and the other extractor functions handling
extracted_answer/expected_answer) in your changes.
- Around line 133-151: The convert_file function currently opens output_path for
writing before conversion, risking partial/truncated output on errors; change
convert_file so it first fully reads and converts all rollouts by iterating
convert_rollouts(rollouts, metric_type=metric_type) into an in-memory list (or
stream to a temp file in the same directory), only then open output_path for
writing (or atomically replace it by writing to a temp file and os.replace into
output_path). Ensure you preserve the returned row count n and the LOG.info
message, and reference convert_rollouts, rollouts_path, output_path, and n when
locating where to implement the two-phase convert-then-write behavior.
In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 139-197: The returned command string is built with "
".join(passthrough_parts + gym_parts) which loses shell quoting and splits
tokens containing spaces; replace that final join with
shlex.join(passthrough_parts + gym_parts) (ensure shlex is imported/available)
so tokens like values merged into extra_body remain properly shell-quoted;
update the return expression in the function that builds
passthrough_parts/gym_parts/extra_body (the block that currently ends with
return " ".join(...).strip()) to use shlex.join(...).strip() instead.
---
Nitpick comments:
In `@tests/test_gym_to_skills_adapter.py`:
- Around line 63-123: Add negative contract tests to
tests/test_gym_to_skills_adapter.py that call convert_math_rollout with rollouts
missing required keys and assert they raise KeyError: create variants from
_gym_rollout then del rollout["response"], del rollout["extracted_answer"], del
rollout["expected_answer"], and a variant where you remove "library_reward" and
also del rollout["reward"] to verify missing reward raises KeyError; wrap each
call in pytest.raises(KeyError) so convert_math_rollout fails loudly rather than
using .get(), referencing convert_math_rollout and _gym_rollout to locate the
test targets.
In `@tests/test_gym_translator.py`:
- Around line 202-208: The test test_quoted_values_survive is missing an
assertion that the quoted token stays a single argument; update the test to
assert that translate_skills_overrides_to_gym preserves the quoted passthrough
as a single token by checking for the expected output fragment corresponding to
++some_string="hello world" (use the translate_skills_overrides_to_gym call and
assert the resulting string contains the single-token form for that
passthrough), referencing the test function name and
translate_skills_overrides_to_gym to locate and modify the assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a92f9211-53e0-4a58-9768-930c74ebd2d7
📒 Files selected for processing (12)
nemo_skills/pipeline/adapters/__init__.pynemo_skills/pipeline/adapters/gym_to_skills.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/gym/__init__.pynemo_skills/pipeline/utils/gym/registry.pynemo_skills/pipeline/utils/gym/translator.pynemo_skills/pipeline/utils/scripts/__init__.pynemo_skills/pipeline/utils/scripts/gym_eval.pytests/test_eval.pytests/test_gym_eval_script.pytests/test_gym_to_skills_adapter.pytests/test_gym_translator.py
| def _extract_output_text(response: Dict[str, Any]) -> str: | ||
| """Pull the assistant-message text out of an OpenAI Responses-API response.""" | ||
| chunks: List[str] = [] | ||
| for item in response.get("output", []) or []: | ||
| for c in item.get("content", []) or []: | ||
| if c.get("type") == "output_text": | ||
| text = c.get("text") | ||
| if text: | ||
| chunks.append(text) | ||
| return "".join(chunks) |
There was a problem hiding this comment.
Fail fast on required rollout schema keys
This converter silently degrades required fields (response, extracted_answer, expected_answer, and reward fallback), which can emit corrupted Skills rows instead of surfacing schema drift immediately.
Proposed fix
def _extract_output_text(response: Dict[str, Any]) -> str:
"""Pull the assistant-message text out of an OpenAI Responses-API response."""
chunks: List[str] = []
- for item in response.get("output", []) or []:
- for c in item.get("content", []) or []:
- if c.get("type") == "output_text":
- text = c.get("text")
+ for item in response["output"]:
+ for c in item["content"]:
+ if c["type"] == "output_text":
+ text = c["text"]
if text:
chunks.append(text)
return "".join(chunks)
@@
- library_reward = rollout.get("library_reward")
- if library_reward is None:
+ library_reward = rollout["library_reward"] if "library_reward" in rollout else rollout["reward"]
+ if library_reward is None:
# Fall back to the overall reward when the server doesn't expose the
# symbolic-only signal (we lose the judge/symbolic separation but at
# least we get a pass/fail).
- library_reward = rollout.get("reward", 0.0)
+ library_reward = rollout["reward"]
@@
out: Dict[str, Any] = {
- "generation": _extract_output_text(rollout.get("response", {}) or {}),
- "predicted_answer": rollout.get("extracted_answer"),
- "expected_answer": rollout.get("expected_answer"),
+ "generation": _extract_output_text(rollout["response"]),
+ "predicted_answer": rollout["extracted_answer"],
+ "expected_answer": rollout["expected_answer"],
"symbolic_correct": bool(library_reward == 1.0),
}As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
Also applies to: 70-81
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 45 - 54, The
converter is silently tolerating missing rollout-schema keys by using dict.get,
so update the code to fail fast: in _extract_output_text and the related
extractor around lines referenced (handling response, extracted_answer,
expected_answer, reward fallback), replace uses of response.get(...) and similar
.get() calls with direct key access (e.g., response["output"], item["content"],
c["type"], c["text"]) and add explicit, descriptive exceptions (KeyError or
ValueError) when required keys are absent so schema drift surfaces immediately;
ensure any fallback logic for reward is removed or replaced with a clear error
and include the symbol names (_extract_output_text and the other extractor
functions handling extracted_answer/expected_answer) in your changes.
| def convert_file( | ||
| rollouts_path: str | Path, | ||
| output_path: str | Path, | ||
| *, | ||
| metric_type: str, | ||
| ) -> int: | ||
| """Read `rollouts_path`, write Skills-shape JSONL to `output_path`. Returns row count.""" | ||
| rollouts_path = Path(rollouts_path) | ||
| output_path = Path(output_path) | ||
| output_path.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| n = 0 | ||
| with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout: | ||
| rollouts = (json.loads(line) for line in fin if line.strip()) | ||
| for prediction in convert_rollouts(rollouts, metric_type=metric_type): | ||
| fout.write(json.dumps(prediction)) | ||
| fout.write("\n") | ||
| n += 1 | ||
| LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path) |
There was a problem hiding this comment.
Protect output files from partial writes on conversion failures
convert_file truncates/writes output_path while parsing/converting is still running. If a malformed JSON line or schema error occurs, callers can be left with partial output.
Proposed fix
+import tempfile
@@
def convert_file(
@@
) -> int:
@@
- n = 0
- with rollouts_path.open("r", encoding="utf-8") as fin, output_path.open("w", encoding="utf-8") as fout:
- rollouts = (json.loads(line) for line in fin if line.strip())
- for prediction in convert_rollouts(rollouts, metric_type=metric_type):
- fout.write(json.dumps(prediction))
- fout.write("\n")
- n += 1
+ n = 0
+ with rollouts_path.open("r", encoding="utf-8") as fin:
+ rollouts = (json.loads(line) for line in fin if line.strip())
+ converted = list(convert_rollouts(rollouts, metric_type=metric_type))
+
+ with tempfile.NamedTemporaryFile("w", encoding="utf-8", dir=output_path.parent, delete=False) as tmp:
+ tmp_path = Path(tmp.name)
+ for prediction in converted:
+ tmp.write(json.dumps(prediction))
+ tmp.write("\n")
+ n += 1
+
+ tmp_path.replace(output_path)
LOG.info("Wrote %d converted rows from %s to %s", n, rollouts_path, output_path)
return nAs per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/adapters/gym_to_skills.py` around lines 133 - 151, The
convert_file function currently opens output_path for writing before conversion,
risking partial/truncated output on errors; change convert_file so it first
fully reads and converts all rollouts by iterating convert_rollouts(rollouts,
metric_type=metric_type) into an in-memory list (or stream to a temp file in the
same directory), only then open output_path for writing (or atomically replace
it by writing to a temp file and os.replace into output_path). Ensure you
preserve the returned row count n and the LOG.info message, and reference
convert_rollouts, rollouts_path, output_path, and n when locating where to
implement the two-phase convert-then-write behavior.
| for token in shlex.split(extra_arguments): | ||
| key, value, prefix = _parse_token(token) | ||
|
|
||
| # Non-override token (e.g. `--config-name foo`): pass through unchanged. | ||
| if prefix is None: | ||
| passthrough_parts.append(token) | ||
| continue | ||
|
|
||
| if key in _REJECTED: | ||
| raise UnsupportedSkillsOverrideError(f"{key}: {_REJECTED[key]}") | ||
|
|
||
| if key in _SILENTLY_DROPPED: | ||
| LOG.debug("Dropping Skills override %s (%s)", key, _SILENTLY_DROPPED[key]) | ||
| continue | ||
|
|
||
| if any(key.startswith(p) for p in _DROPPED_PREFIXES): | ||
| matched_prefix = next(p for p in _DROPPED_PREFIXES if key.startswith(p)) | ||
| LOG.debug("Dropping Skills override %s (%s)", key, _DROPPED_PREFIXES[matched_prefix]) | ||
| continue | ||
|
|
||
| if key in _EXTRA_BODY_KEYS: | ||
| extra_body[_EXTRA_BODY_KEYS[key]] = value | ||
| continue | ||
|
|
||
| if key in _DIRECT_RENAMES: | ||
| gym_parts.append(f"+{_DIRECT_RENAMES[key]}={value}") | ||
| continue | ||
|
|
||
| # `++inference.extra_body.foo=bar` → merge into extra_body dict. | ||
| if key.startswith("inference.extra_body."): | ||
| sub_key = key[len("inference.extra_body.") :] | ||
| extra_body[sub_key] = value | ||
| continue | ||
|
|
||
| # Server / sandbox overrides are handled by the pipeline layer | ||
| # (model-server YAML assembly), not by ng_collect_rollouts. Drop them | ||
| # here so they don't end up as bogus Gym flags. | ||
| if key.startswith("server.") or key.startswith("sandbox."): | ||
| LOG.debug("Dropping Skills override %s (handled by pipeline layer)", key) | ||
| continue | ||
|
|
||
| if strict_unknown: | ||
| raise UnsupportedSkillsOverrideError( | ||
| f"No Gym mapping for Skills override `++{key}`. " | ||
| f"Add it to nemo_skills/pipeline/utils/gym/translator.py " | ||
| f"or use --backend=skills." | ||
| ) | ||
| LOG.warning("Unmapped Skills override `++%s` passed through unchanged", key) | ||
| passthrough_parts.append(token) | ||
|
|
||
| if extra_body: | ||
| # Dict-literal syntax to dodge the `extra_body: null` Hydra gotcha | ||
| # (vllm_model.yaml ships extra_body: null, which silently no-ops | ||
| # `++…extra_body.foo=bar`). See memory/feedback_hydra_extra_body_override.md. | ||
| body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items())) | ||
| gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}") | ||
|
|
||
| return " ".join(passthrough_parts + gym_parts).strip() | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the relevant section
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '130,205p'Repository: NVIDIA-NeMo/Skills
Length of output: 3631
🏁 Script executed:
# Also check the function signature to understand what this is part of
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '120,205p'Repository: NVIDIA-NeMo/Skills
Length of output: 4138
🏁 Script executed:
# Search for usages of this function's return value to understand how output is consumed
rg "translate_skills_args|_translate_overrides" nemo_skills/ -A 5 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check the function signature and docstring to understand return value requirements
cat -n nemo_skills/pipeline/utils/gym/translator.py | sed -n '100,130p'Repository: NVIDIA-NeMo/Skills
Length of output: 1454
🏁 Script executed:
# Check what Python version this project targets
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | head -5 | xargs cat 2>/dev/null | grep -A 5 -B 5 "python"Repository: NVIDIA-NeMo/Skills
Length of output: 1627
🏁 Script executed:
# Search for actual usage of the translate_skills_args function in shell contexts
rg "translate_skills_args\(" nemo_skills/ -B 3 -A 3Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Search for actual usage and how the return value is consumed
rg "translate_skills_overrides_to_gym" --type py -B 2 -A 5Repository: NVIDIA-NeMo/Skills
Length of output: 13733
🏁 Script executed:
# Check if shlex.join is used elsewhere in the codebase
rg "shlex\.join" nemo_skills/Repository: NVIDIA-NeMo/Skills
Length of output: 395
🏁 Script executed:
# Look for subprocess calls or shell command construction that might use this function's output
rg "ng_collect_rollouts" --type py -B 5 -A 5Repository: NVIDIA-NeMo/Skills
Length of output: 21763
🏁 Script executed:
# Verify the specific issue: check how gym_eval.py constructs the full command
cat -n nemo_skills/pipeline/utils/scripts/gym_eval.py | sed -n '245,290p'Repository: NVIDIA-NeMo/Skills
Length of output: 2275
🏁 Script executed:
# Look at a concrete scenario: what happens with values containing spaces in extra_body
python3 <<'EOF'
import shlex
# Simulate the scenario
key = "foo"
value = "hello world" # Value with space
# Current approach (what the code does)
gym_parts = []
extra_body = {}
extra_body["foo"] = "hello world"
body_str = ", ".join(f"{k}: {v}" for k, v in sorted(extra_body.items()))
gym_parts.append(f"+responses_create_params.extra_body={{{body_str}}}")
result_current = " ".join(gym_parts).strip()
print(f"Current result: {result_current}")
print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_current)}")
# With shlex.join
result_proper = shlex.join(gym_parts).strip()
print(f"\nWith shlex.join: {result_proper}")
print(f"If re-parsed by shell, becomes tokens: {shlex.split(result_proper)}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 410
Preserve shell token boundaries when rebuilding translated args.
Line 196 uses plain string join after shlex.split(), which fails to quote tokens containing spaces. When values are merged into the extra_body dict (e.g., "hello world"), the resulting string gets split into multiple tokens when parsed as shell arguments rather than a single argument. The docstring promises a "shell-quoted string" but the implementation doesn't deliver it.
Use shlex.join() instead, which is available in Python 3.8+ (project requires >=3.10) and already used elsewhere in the codebase:
- return " ".join(passthrough_parts + gym_parts).strip()
+ return shlex.join(passthrough_parts + gym_parts).strip()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 139 - 197, The
returned command string is built with " ".join(passthrough_parts + gym_parts)
which loses shell quoting and splits tokens containing spaces; replace that
final join with shlex.join(passthrough_parts + gym_parts) (ensure shlex is
imported/available) so tokens like values merged into extra_body remain properly
shell-quoted; update the return expression in the function that builds
passthrough_parts/gym_parts/extra_body (the block that currently ends with
return " ".join(...).strip()) to use shlex.join(...).strip() instead.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_eval.py (1)
349-598: 💤 Low valueConsider extracting shared setup into a fixture.
The three container selection tests share substantial setup code (cluster_config, benchmark_args, unit creation, monkeypatches). A parameterized fixture or helper function could reduce duplication and make the test intent clearer. That said, the current implementation is readable and self-contained.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_eval.py` around lines 349 - 598, Extract the repeated setup in test_gym_backend_picks_gym_container_falls_back_to_nemo_rl, test_gym_backend_prefers_nemo_gym_container_when_present, and test_skills_backend_keeps_using_nemo_skills_container into a shared pytest fixture or helper function: move creation of cluster_config, benchmark_args, unit, fake_command_init, and the repeated monkeypatch.setattr calls for eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths, get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and Pipeline.run into that fixture/helper and have each test call it (or parametrize container mappings and backend); keep assertions (captured["client_container"]) and calls to eval_pipeline.eval in the individual tests and reference the same symbols (cluster_config, benchmark_args, unit, fake_command_init, prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_eval.py`:
- Around line 349-598: Extract the repeated setup in
test_gym_backend_picks_gym_container_falls_back_to_nemo_rl,
test_gym_backend_prefers_nemo_gym_container_when_present, and
test_skills_backend_keeps_using_nemo_skills_container into a shared pytest
fixture or helper function: move creation of cluster_config, benchmark_args,
unit, fake_command_init, and the repeated monkeypatch.setattr calls for
eval_pipeline.pipeline_utils (get_cluster_config, resolve_mount_paths,
get_env_variables, check_mounts, get_exp, run_exp), prepare_eval_commands, and
Pipeline.run into that fixture/helper and have each test call it (or parametrize
container mappings and backend); keep assertions (captured["client_container"])
and calls to eval_pipeline.eval in the individual tests and reference the same
symbols (cluster_config, benchmark_args, unit, fake_command_init,
prepare_eval_commands, eval_pipeline.eval) so behavior remains identical.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bba08dfd-6a14-48fb-bd19-a13a4aabf801
📒 Files selected for processing (2)
nemo_skills/pipeline/eval.pytests/test_eval.py
Skills and Gym ship per-benchmark input JSONLs with different schemas — for gsm8k, Skills uses `problem` and Gym uses `question`, so feeding Skills' test.jsonl into ng_collect_rollouts via the resource server's `apply_prompt_to_row` produces empty prompts. The unit tests passed because they never exercised the schema dependency. - Add `input_jsonl_fpath` to GymBenchmarkConfig (required). - New `gym_input_jsonl_fpath` field on GymEvalClientScript: when set, the ng_collect_rollouts input file is the resolved `$GYM_PATH/<rel-path>` at runtime instead of the unit's Skills-side input. - Dispatcher in ns eval threads the registry value through. Tests: override forces the Gym path; absence falls back to the unit's input_file (kept for back-compat / non-benchmark uses). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Skills' `++prompt_config=generic/math` is a path under nemo_skills/prompt/config/. The translator was passing it through as an unknown override; Gym's `ng_collect_rollouts` accepts a `prompt_config` field but interprets it as a literal file path under the Gym install dir, so the smoke run crashed with `FileNotFoundError: /opt/Gym/generic/math`. In Gym the prompt config lives in the agent yaml's `datasets[]` entry — nothing should override it via CLI. Add prompt_config and prompt_template (used by some judge benchmarks) to the silent-drop list. Tests: parametrised drop coverage, plus the gsm8k realistic-combo regression test now asserts no `prompt_config` leaks through. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/pipeline/utils/gym/translator.py (1)
145-202:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
shlex.jointo preserve argument boundaries.Line 202 rebuilds argv with plain
" ".join(...), which drops shell quoting and can corrupt overrides containing spaces.Proposed fix
- return " ".join(passthrough_parts + gym_parts).strip() + return shlex.join(passthrough_parts + gym_parts).strip()#!/bin/bash # Verify current implementation and reproduce token-boundary breakage. set -euo pipefail echo "1) Confirm return expression:" nl -ba nemo_skills/pipeline/utils/gym/translator.py | sed -n '190,206p' echo echo "2) Demonstrate why plain join is lossy:" python - <<'PY' import shlex parts = ["+responses_create_params.extra_body={note: hello world}", "--flag", "x"] plain = " ".join(parts) safe = shlex.join(parts) print("plain :", plain) print("split :", shlex.split(plain)) print("safe :", safe) print("split :", shlex.split(safe)) assert shlex.split(safe) == parts PY🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/utils/gym/translator.py` around lines 145 - 202, The return value is rebuilt with " ".join(passthrough_parts + gym_parts) which loses shell quoting and can break tokens with spaces; change that final join to use shlex.join(passthrough_parts + gym_parts) so argument boundaries are preserved (and add/import shlex at top if not already imported); keep the same list order (passthrough_parts then gym_parts) and ensure the existing variables passthrough_parts and gym_parts are used unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@nemo_skills/pipeline/utils/gym/translator.py`:
- Around line 145-202: The return value is rebuilt with "
".join(passthrough_parts + gym_parts) which loses shell quoting and can break
tokens with spaces; change that final join to use shlex.join(passthrough_parts +
gym_parts) so argument boundaries are preserved (and add/import shlex at top if
not already imported); keep the same list order (passthrough_parts then
gym_parts) and ensure the existing variables passthrough_parts and gym_parts are
used unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2b447759-28d9-480c-936f-eb997cf6986d
📒 Files selected for processing (2)
nemo_skills/pipeline/utils/gym/translator.pytests/test_gym_translator.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_gym_translator.py
The Skills-derived input JSONL carries raw fields (question/expected_answer) rather than pre-built responses_create_params, so ng_collect_rollouts has to render the prompt itself. The Gym agent yaml lists the prompt path inside `datasets[].prompt_config`, but the CLI doesn't read it from there — it has to be passed as +prompt_config explicitly. - Add `prompt_config` field (required) to GymBenchmarkConfig; populate it for gsm8k from the agent yaml's value. - New optional `gym_prompt_config` on GymEvalClientScript: when set, injects `+prompt_config="$GYM_PATH"/<rel-path>` so the path resolves against the same Gym dir as everything else at runtime. - Dispatcher in ns eval threads the registry value through. Without this fix the rollout crashes with `KeyError: 'responses_create_params'` during input preprocessing (rollout_collection.py:216). Tests: injection when set; absent when not. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 570bec36-f234-4610-beef-debb514494bd
📒 Files selected for processing (4)
nemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/gym/registry.pynemo_skills/pipeline/utils/scripts/gym_eval.pytests/test_gym_eval_script.py
🚧 Files skipped from review as they are similar to previous changes (3)
- nemo_skills/pipeline/utils/gym/registry.py
- tests/test_gym_eval_script.py
- nemo_skills/pipeline/utils/scripts/gym_eval.py
| if backend == EvalBackend.gym.value: | ||
| # Gym runs one shared `ng_run` mesh per SLURM job, so all units | ||
| # in a job must share the same Gym wiring. Enforce one benchmark | ||
| # per job rather than per-unit mesh-switching. | ||
| if len(job_benchmarks) > 1: | ||
| raise ValueError( | ||
| f"--backend=gym requires one benchmark per SLURM job; this job " | ||
| f"would mix {sorted(job_benchmarks)}. Re-run with --num_jobs=-1 " | ||
| f"(or large enough that each benchmark gets its own job)." | ||
| ) | ||
| (gym_benchmark,) = job_benchmarks | ||
| gym_cfg: GymBenchmarkConfig = get_gym_config(gym_benchmark) | ||
| # Effective metric_type for the gym→skills converter: prefer | ||
| # the CLI override, else the benchmark's METRICS_TYPE module | ||
| # constant. Same precedence summarize_results uses. | ||
| gym_metric_type = metric_type or benchmarks_dict[gym_benchmark].metrics_type | ||
| client_script = GymEvalClientScript( | ||
| units=unit_dicts, | ||
| config_paths=list(gym_cfg.config_paths), | ||
| agent_name=gym_cfg.agent_name, | ||
| gym_input_jsonl_fpath=gym_cfg.input_jsonl_fpath, | ||
| gym_prompt_config=gym_cfg.prompt_config, | ||
| metric_type=gym_metric_type, | ||
| single_node_mode=single_node_mode, | ||
| with_sandbox=sandbox_enabled, | ||
| servers=server_scripts, | ||
| server_addresses_prehosted=server_addresses_list, | ||
| model_names=models_list, | ||
| server_types=server_types_list, | ||
| sandbox=sandbox_script, | ||
| installation_command=installation_command, | ||
| ) |
There was a problem hiding this comment.
Reject unsupported Gym modes here.
Lines 619-650 only enforce “one benchmark per job”. --backend=gym still accepts multi-model inputs and chunked eval units, even though this path is scoped to a single model without chunking. Please fail fast before constructing GymEvalClientScript; otherwise those user flags are accepted and the failure moves downstream.
Suggested guard
if backend == EvalBackend.gym.value:
+ if num_models != 1:
+ raise NotImplementedError(
+ "--backend=gym currently supports exactly one model per evaluation job."
+ )
+ if any(unit["chunk_id"] is not None or (unit["num_chunks"] or 1) > 1 for unit in unit_dicts):
+ raise NotImplementedError(
+ "--backend=gym does not support --num_chunks/--chunk_ids yet."
+ )
# Gym runs one shared `ng_run` mesh per SLURM job, so all units
# in a job must share the same Gym wiring. Enforce one benchmark
# per job rather than per-unit mesh-switching.As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
ng_collect_rollouts writes a sidecar `<stem>_materialized_inputs.jsonl`
next to its output. The parent directory has to exist before that, but
neither nemo_run nor the GymEvalClientScript was creating it, so the
smoke run hit:
FileNotFoundError: [Errno 2] No such file or directory:
'<output_dir>/rollouts_materialized_inputs.jsonl'
Mirrors NemoGymRolloutsScript which does the same mkdir.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
The post-rollout converter runs inside the Gym container, which only has
`nemo-skills-core` installed (no `nemo_run`). The adapter previously lived
under `nemo_skills.pipeline.adapters`; importing it triggered the pipeline
package's `nemo_run` guard:
ImportError: nemo-skills-core is installed but pipeline functionality
requires the full package.
Move to `nemo_skills.adapters` (top-level). The adapter has no pipeline
dependencies, so this is purely about being reachable from the core install.
Updates the `python -m` invocation in GymEvalClientScript and all test
imports / shell-emit assertions.
Surfaced by the smoke run: the rollout itself completed at 90% symbolic
accuracy on 10 problems; only the post-step conversion failed to start.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Hydra accepts `+responses_create_params.extra_body={seed: 0}` as a single
override, but the space inside the dict makes bash word-split it into two
args. Hydra then sees `+…extra_body={seed:` and `0}` separately and dies
with `no viable alternative at input '{seed:'`. Surfaced live on a 2-seed
parity run (seed=0/1 → extra_body emitted; the smoke didn't trip it
because no extra_body was generated when random_seed=None).
Wrap both call sites in double quotes:
- Translator (when min_p/top_k/repetition_penalty/random_seed produce
extra_body, plus any explicit `++inference.extra_body.foo=bar`).
- GymEvalClientScript (when per-unit random_seed surfaces as
`extra_body={seed: N}`).
Tests: existing extra_body assertions updated to expect the quoted form;
new regression test for the per-seed shell-quoting case.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…forbid'
I'd been routing Skills' vLLM-only knobs (top_k, min_p, repetition_penalty,
random_seed) and the per-job random_seed through
`+responses_create_params.extra_body={…}`. That's wrong: Gym's
`NeMoGymResponseCreateParamsNonStreaming` is `extra="forbid"` and has no
`extra_body` field on the per-call schema. The 2-seed parity run on the
full benchmark surfaced this — every rollout 422'd with:
"type": "extra_forbidden",
"loc": ["body", "responses_create_params", "extra_body"]
(The smoke didn't trip it because no extra_body was emitted when
random_seed=None and the recipe didn't include any vLLM-only knobs.)
The `extra_body` Hydra-override memory was about the *model-server* config
field (`vllm_model.extra_body`) which is static / per-server, not per-call.
The right home for these knobs is overriding the model-server config at
ng_run time, not threading per-call args through ng_collect_rollouts.
Changes:
- Translator: vLLM-only knobs become silent-with-warning drops. Document
the proper home in the warning text. Same for ++inference.extra_body.*.
- GymEvalClientScript: stop injecting `extra_body={seed: N}` per unit. Per-seed
reproducibility is a known gap — Skills' two seeds and Gym's two seeds
will produce uncorrelated variance, but mean@N statistics still compare.
Tests updated accordingly + regression assertions that `extra_body`/`seed:`
never leak into the emitted shell command.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Scope correction: the migration intentionally breaks the Skills
`output.jsonl` / `metrics.json` contract. Gym's native `rollouts.jsonl`
+ `rollouts_aggregate_metrics.json` are the canonical outputs on the
`--backend=gym` path.
Removes:
- `nemo_skills/adapters/gym_to_skills.py` and its package (whole adapter).
- `tests/test_gym_to_skills_adapter.py` (18 tests).
- `metric_type` field, `_convert_cmd` method, and the
`_skills_output_jsonl_for_unit` helper on `GymEvalClientScript`. The
per-unit shell block is now just ng_collect_rollouts (no post-step).
- The dispatcher's `gym_metric_type` plumbing.
Changes:
- `ns eval` skips the `summarize_results` SLURM task on the gym path —
Gym already writes its aggregate metrics next to the rollouts JSONL,
and `summarize_results` expects a Skills schema we no longer produce.
Tests:
- Replace the converter-wiring tests with a single negative-regression
("no skills schema artifacts in the emitted shell") on
GymEvalClientScript.
- New `test_eval.py::test_gym_backend_skips_summarize_results_task`
asserts the dispatcher doesn't schedule the dependent summarize job
for `--backend=gym`.
The 2.46pp pass@1 gap reported in the bring-up parity run was an artefact
of Skills' MathMetrics processing Gym's lenient `extracted_answer`. With
this change we no longer feed Gym data through Skills' MathMetrics, so
that gap doesn't apply.
83 tests pass.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
All four wrap `math_with_judge` with a per-benchmark agent of the same shape as gsm8k. Factor `_math_with_judge_entry()` so each registration is one line plus an optional prompt_config override (aime/hmmt ship their own per-benchmark prompt YAML; gsm8k and hendrycks_math share the shared `benchmarks/prompts/generic_math.yaml`). Tests: parametrized check that each entry resolves to the right config_paths / agent_name / input_jsonl_fpath / prompt_config — guards against typos in the path layout. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Upstream Gym restructured `benchmarks/prompts/`: the flat `generic_math.yaml`, `math_answer_judge.yaml`, etc. moved to sub-directories (`generic/math.yaml`, `judge/math.yaml`, …). The per-benchmark prompt YAMLs we'd been pointing aime24/aime25/hmmt_feb25 at (`benchmarks/<bench>/prompts/default.yaml`) were also removed — every math_with_judge benchmark now uses the shared `benchmarks/prompts/generic/math.yaml`. Update all five registry entries to the new path, simplify the helper's default to match, and tighten the parametrized registry test to assert the shared prompt config. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Phase 3 starts diversifying metric types beyond math. Neither benchmark fits the `math_with_judge` helper (agent names use `_mcqa_` / `_benchmark_` infix, not `_math_with_judge_`), so they're registered as standalone GymBenchmarkConfig entries. - gpqa → resources_servers/mcqa with `lenient_answer_colon_md` grading; data file is `gpqa_diamond_benchmark.jsonl` (Gym ships the diamond split as the default). - ifbench → resources_servers/ifbench; agent name has `_benchmark_` infix per the Gym yaml convention. livecodebench deferred — has v5/v6 split, needs sandbox, and uses the code_gen resource server. Separate cycle. Tests: per-benchmark registry checks asserting agent_name / input file / prompt path. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
…ompt path - livecodebench: Skills' `EVAL_SPLIT="test_v6_2408_2505"` maps to Gym's `benchmarks/livecodebench/v6_2408_2505/` config. Code execution is handled inside Gym's code_gen resource server (Ray subprocess pool), so this entry doesn't set Skills' `--with-sandbox`. - gpqa: upstream Gym moved `benchmarks/<bench>/prompts/default.yaml` to shared `benchmarks/prompts/eval/aai/mcq-4choices.yaml`. Update the registry entry + test to point at the current path. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Replace the hand-rolled 8-entry dict with the full set of 57 benchmarks whose Gym `benchmarks/<name>/config.yaml` maps to a Skills dataset module. Entries are generated by `tools/discover_registry.py` in the recipe root. Regenerate after any upstream Gym pull. Most are registered but unvalidated — adding the entry just unblocks `--backend=gym` to try them. Each will surface its own quirks (metric naming, container deps, schema mismatches) when actually run. Known already-validated set: gsm8k, aime24/25, hmmt_feb25, hendrycks_math (math), ifbench (IF), gpqa (mcqa). livecodebench has the registry entry but live-validation is still pending. Also relaxed `GymBenchmarkConfig.prompt_config` to `Optional[str]` since some Gym benchmarks (aalcr, mrcr) ship without one — input rows already carry `responses_create_params.input`. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Summary
Scaffolds a Gym generation backend for
ns eval, gated on an opt-in--backend={skills,gym}flag (default:skills, no behavior change). Thegympath runsng_run+ng_collect_rolloutsagainst NeMo-Gym, converts the resulting rollouts into a Skills-shapeoutput.jsonl, and lets the existingsummarize_resultsstep producemetrics.jsonunchanged.Builds on PR #1222 (
EvalGenerationUnit/EvalClientScript). ReusesServerScript/SandboxScript/ hostname-ref resolution andNemoGymRolloutsScript's install-and-wait shell scaffolding.Pilot scope is gsm8k only — other benchmarks fall through to the existing Skills path. A tiny per-benchmark registry (
nemo_skills/pipeline/utils/gym/registry.py) maps a benchmark name to(config_paths, agent_name); once Gym's auto-resolver PR lands, this is the obvious replacement point.What's in here
Three reviewable commits:
feat(eval): add --backend={skills,gym} flag and Skills→Gym override translatorEvalBackendenum +--backendtyper flag onns eval.nemo_skills/pipeline/utils/gym/translator.py: pure function that rewrites Skills++Hydra overrides into Gym+overrides forng_collect_rollouts. Handles direct renames,extra_bodymerging (with dict-literal syntax to dodge theextra_body: nullHydra-override gotcha), and an explicit drop-list for fields the migration agreed not to carry over.feat(eval): wire --backend=gym to a GymEvalClientScript pilot (gsm8k only)nemo_skills/pipeline/utils/scripts/gym_eval.py:GymEvalClientScript, peer ofEvalClientScript. Builds a single shell command: resolve Gym path → activate venv → wait for policy server → startng_run→ pollng_status→ run oneng_collect_rolloutsper unit → clean up.nemo_skills/pipeline/utils/gym/registry.py: temporary benchmark registry (gsm8k only).pipeline/eval.py: registered-benchmark pre-check, one-benchmark-per-job guard, multi-model not supported.feat(eval): gym→skills metrics adapter (math case)nemo_skills/pipeline/adapters/gym_to_skills.py: pure converter + file-level wrapper + argparse CLI. Mapsmath_with_judgerollouts to{generation, predicted_answer, expected_answer, symbolic_correct}(plus an opt-injudgementplaceholder for later judge benchmarks).summarize_resultsproducesmetrics.jsonwith no changes.Test plan
extra_bodymerging, silent-drops, server/sandbox drops, rejected overrides, unknown-passthrough vs strict mode, realistic gsm8k combo).GymEvalClientScript: 15 unit tests (registry, validation, shell-emit shape, per-seed / per-chunk output paths, translator pipeline, server hostname ref, sandbox env, multi-unit single-ng_runsharing, converter wiring).library_rewardabsent, multi-chunk output-text concat, judge surfacing, LaTeX pass-through, file IO, blank-line handling, dst-dir creation, CLI exit codes).ns evalsentinel tests: backend flag wired; gym path rejects unregistered benchmarks; mixed registered+unregistered rejected.test_eval.py.Things deferred
ns generatemigration: separate follow-up; same dispatcher pattern.NotImplementedError; Skills backend covers both.summarize_resultsconsumers working unchanged.Risk
--backend=skillsis the default and is untouched. The new code paths are gated behind--backend=gymand refuse to start for any benchmark not in the registry.Summary by CodeRabbit
New Features
Tests