Gym CLI refactor by marta-sd · Pull Request #1630 · NVIDIA-NeMo/Gym

marta-sd · 2026-06-17T09:03:05Z

Implements #1434

Notes:

Hydra dash-flags no longer supported. The gym CLI parses its own flags, so
Hydra's native --multirun, --config-dir, etc. are no longer supported. W assume blast radius is small (we don't have them in docs, most users already use overrides).
Hydra diagnostics now go to stderr. Hydra's log handler is redirected inside
get_global_config_dict(), so its diagnostic lines move from stdout to stderr (this also
affects legacy ng_* commands). This was needed to keep the stdout clean for --json/piping.

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

copy-pr-bot · 2026-06-17T09:03:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…nt deprecation note Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

… to stderr) Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

…a single --model flag Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

marta-sd · 2026-06-22T12:07:03Z

/claude review

marta-sd · 2026-06-24T13:20:46Z

can you test with a test.pypi wheel to ensure pypi doesnt break either?

@cmunley1 I don't know how to do it for Gym. Is there a CI workflow I should trigger?

Update: I see there was one, but it was removed I think? I downloaded the wheel from artifacts (job) and verified it installs and basic commands (gym --help, gym --version, gym list benchmarks) work in a fresh env.

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

marta-sd · 2026-06-25T09:47:10Z

/ok to test 5067752

I hopefully addressed all feedback (no blockers has been raised during the daily)

…tion #12) (#1599) ## What Adds **`gym env validate`** (+ `ng_validate` / `nemo_gym_validate` deprecated shims) — runs the full config parse with **no Ray and no server subprocesses**, then exits **0 (valid) / 1 (invalid)** with a clean, rich-escaped message (**no traceback**). Returns in well under a second instead of after a ~30–60s Ray bootstrap. ```bash gym env validate --config resources_servers/<env>/configs/<env>.yaml --config responses_api_models/<model>/configs/<model>.yaml gym env validate --benchmark gsm8k --model-type openai_model ``` ## How `validate()` lives in `cli/env.py` and is registered as `env validate` in the `gym` router (`cli/main.py` COMMANDS) with the same config-selection flags as `env start` (`--config`, `--benchmark`, `--environment`, `--resources-server`, `--model-type`, `--search-dir`, `--model*`). It reuses the same `get_global_config_dict()` parse path the other commands use, so the validation checks stay in sync: - **config_paths** resolution — missing/typo'd ([#1488](#1488)) and malformed ([#1490](#1490)) - **server cross-references** — unknown `name:` refs ([#1561](#1561)) - **mandatory `???`** values ([#1575](#1575)) - **schema** (`BaseNeMoGymCLIConfig`) Wrapped in `exit_cleanly_on_config_error` (from #1609) so any `ConfigError` becomes a clean message + `exit 1`. A dummy `policy_model` is injected (the `NO_MODEL` parser config, as in `gym list` / `env compose`) so model interpolations like `${policy_base_url}` resolve without real creds — validation is about config **well-formedness**; the real model is supplied by the `--model*` flags at run time. ## Targets `main` Originally drafted on the unified-CLI epic branch; rebuilt directly on `main` now that [#1630](#1630) (and #1637/#1609/#1635/#1671) have merged. The old branch contents (a snapshot of the CLI refactor + unrelated CI commits) were superseded and replaced. ## Scope note The zero-server check ([#1489](#1489), "nothing configured to run") is intentionally **not** part of `validate`: `NO_MODEL` injects a dummy model server (which would defeat the check), and "is anything configured to run" is a *start*-time concern already enforced by `gym env start` before Ray init. `validate` focuses on config well-formedness. ## Why Epic [#1205](#1205) friction #12 (no config validation tooling) — the M1 "fast failure triage" deliverable. Config errors otherwise only surface after Ray starts (~30–60s). ## Tests - `test_cli_main.py`: `gym env validate --config X` routes to `nemo_gym.cli.env:validate` with `+config_paths=[X]` (added to the parametrized config-command matrix). - `test_cli.py`: `validate()` prints OK on a valid config; a raised `ConfigError` becomes `exit 1` (no traceback). - All `test_cli` + `test_cli_main` + `test_cli_legacy` pass (the only failures are the pre-existing Python-3.12 `TestDidYouMean` argparse issue on `main`); ruff + pre-commit clean. Smoke-tested end-to-end: `✓ Config is valid.` on a real benchmark, clean error + `exit 1` on a bad path, and the `ng_validate` deprecation shim. --------- Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

:wq Gym CLI refactor (#1630) Implements #1434 Notes: - **Hydra dash-flags no longer supported.** The `gym` CLI parses its own flags, so Hydra's native `--multirun`, `--config-dir`, etc. are no longer supported. W assume blast radius is small (we don't have them in docs, most users already use overrides). - **Hydra diagnostics now go to stderr.** Hydra's log handler is redirected inside `get_global_config_dict()`, so its diagnostic lines move from stdout to stderr (this also affects legacy `ng_*` commands). This was needed to keep the stdout clean for `--json`/piping. --------- Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Rita Fernandes Neves <rfernandesne@nvidia.com>

Resolve conflict in fern/.../reference/cli-commands.mdx: upstream's CLI refactor (NVIDIA-NeMo#1630) replaced ng_collect_rollouts docs with the 'gym eval run' command. Re-applied the per-agent num_repeats (dict-form) documentation onto the new --num-repeats flag (table entry + example). rollout_collection.py and test_rollout_collection.py auto-merged cleanly; all 38 unit tests pass. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

marta-sd added 11 commits June 12, 2026 13:40

chore: move all cli functions into a single location

0b159e3

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add basic 'gym' command router

7cdcb8e

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add --config and --storage flags to the gym router

cba678a

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

fix: don't pass unknown --flag or -f args to hydra

82e3505

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

chore: add basic tests for testing new cli

b075b92

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: run all tests if no resource server was passed

95ddd2f

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add flags to eval run

0519f87

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add flags to dataset commands

e013525

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add flags for other eval commands

6a79259

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add flags for env commands

bed6576

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

chore: print deprecation notice for all legacy commands

4a05371

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

marta-sd self-assigned this Jun 17, 2026

marta-sd added 9 commits June 17, 2026 11:04

chore: add tests for checking that all ng_ and nemo_gym_ commands pri…

17b79ae

…nt deprecation note Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add --verbose flag

6288170

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: allow to select relevant config through passing server name

6a2cd5c

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add --json flag for machine-readable output (+ move diagnostics…

16431a6

… to stderr) Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add 'did you mean?' hints for typos

02967e0

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add --search-dir for loading user configs from custom location

81fca19

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add gym search command

ba6dc10

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add generic deployment config and --model-checkpoint flag

c53e912

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

fix: unify rendundant --model-name and --model-checkpoint flags into …

4b97e8a

…a single --model flag Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

marta-sd marked this pull request as ready for review June 22, 2026 11:30

marta-sd requested a review from a team as a code owner June 22, 2026 11:30

marta-sd requested review from ananthsub and wprazuch June 22, 2026 11:31

feat: add --benchmark flag to gym env run command

c0747bd

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

claude Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread nemo_gym/cli/legacy.py Outdated

claude Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread nemo_gym/global_config.py

This was referenced Jun 24, 2026

feat(cli): run an environment by name — 'gym env run --env <name>' (#1205 friction #8 / FEP-1022) #1636

Closed

feat: agent registry — name-based agent discovery + composability (M3 core) #1671

Merged

ananthsub reviewed Jun 24, 2026

View reviewed changes

Comment thread nemo_gym/cli/main.py

Comment thread nemo_gym/cli/main.py Outdated

Comment thread nemo_gym/cli/main.py Outdated

Comment thread nemo_gym/cli/main.py

Comment thread nemo_gym/cli/eval.py Outdated

Comment thread nemo_gym/cli/legacy.py

anwithk linked an issue Jun 25, 2026 that may be closed by this pull request

Epic: Gym CLI usability -- first class CLI experience #1434

Closed

12 tasks

marta-sd and others added 8 commits June 25, 2026 08:59

Update nemo_gym/cli/main.py

e150edd

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

Merge branch 'main' into martas/1434

4deeb4f

fix: add --input-glob flag to gym eval aggregate

e552e3a

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

fix: update comment

34b57f1

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: add unit tests for dispatch

7e1525f

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

chore: update docstrings to point to new commands

14a790c

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

feat: test if legacy commands point to real new commands

2775550

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>

Merge branch 'main' into martas/1434

5067752

copy-pr-bot Bot temporarily deployed to public June 25, 2026 09:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 09:49 Inactive

wprazuch approved these changes Jun 25, 2026

View reviewed changes

ko3n1g approved these changes Jun 25, 2026

View reviewed changes

marta-sd merged commit 4a3fe3f into main Jun 25, 2026
31 checks passed

marta-sd deleted the martas/1434 branch June 25, 2026 10:50

wprazuch mentioned this pull request Jun 25, 2026

feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12) #1599

Merged

marta-sd mentioned this pull request Jun 25, 2026

docs: fill CLI reference gaps for data prep and rollout collection #1675

Merged

sephmard mentioned this pull request Jun 29, 2026

docs: add ng_materialize_prompts to CLI reference #1401

Closed

2 tasks

marta-sd mentioned this pull request Jul 2, 2026

chore: update CI workflow to use new CLI commands #1901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gym CLI refactor#1630

Gym CLI refactor#1630
marta-sd merged 45 commits into
mainfrom
martas/1434

marta-sd commented Jun 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

marta-sd commented Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

marta-sd commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marta-sd commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

marta-sd commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

marta-sd commented Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

marta-sd commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marta-sd commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

marta-sd commented Jun 17, 2026 •

edited

Loading

marta-sd commented Jun 24, 2026 •

edited

Loading