Skip to content

Gym CLI refactor#1630

Merged
marta-sd merged 45 commits into
mainfrom
martas/1434
Jun 25, 2026
Merged

Gym CLI refactor#1630
marta-sd merged 45 commits into
mainfrom
martas/1434

Conversation

@marta-sd

@marta-sd marta-sd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Implements #1434

Notes:

  • Hydra dash-flags no longer supported. The gym CLI parses its own flags, so
    Hydra's native --multirun, --config-dir, etc. are no longer supported. W assume blast radius is small (we don't have them in docs, most users already use overrides).
  • Hydra diagnostics now go to stderr. Hydra's log handler is redirected inside
    get_global_config_dict(), so its diagnostic lines move from stdout to stderr (this also
    affects legacy ng_* commands). This was needed to keep the stdout clean for --json/piping.

marta-sd added 11 commits June 12, 2026 13:40
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd marta-sd self-assigned this Jun 17, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

marta-sd added 9 commits June 17, 2026 11:04
…nt deprecation note

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
… to stderr)

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
…a single --model flag

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd marta-sd marked this pull request as ready for review June 22, 2026 11:30
@marta-sd marta-sd requested a review from a team as a code owner June 22, 2026 11:30
@marta-sd marta-sd requested review from ananthsub and wprazuch June 22, 2026 11:31
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread nemo_gym/cli/legacy.py Outdated
Comment thread nemo_gym/global_config.py
@marta-sd

marta-sd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

can you test with a test.pypi wheel to ensure pypi doesnt break either?

@cmunley1 I don't know how to do it for Gym. Is there a CI workflow I should trigger?

Update: I see there was one, but it was removed I think? I downloaded the wheel from artifacts (job) and verified it installs and basic commands (gym --help, gym --version, gym list benchmarks) work in a fresh env.

Comment thread nemo_gym/cli/main.py
Comment thread nemo_gym/cli/main.py Outdated
Comment thread nemo_gym/cli/main.py Outdated
Comment thread nemo_gym/cli/main.py
Comment thread nemo_gym/cli/eval.py Outdated
Comment thread nemo_gym/cli/legacy.py
@anwithk anwithk linked an issue Jun 25, 2026 that may be closed by this pull request
12 tasks
marta-sd and others added 8 commits June 25, 2026 08:59
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd

Copy link
Copy Markdown
Contributor Author

/ok to test 5067752

@marta-sd marta-sd dismissed cmunley1’s stale review June 25, 2026 10:41

I hopefully addressed all feedback (no blockers has been raised during the daily)

@marta-sd marta-sd merged commit 4a3fe3f into main Jun 25, 2026
31 checks passed
@marta-sd marta-sd deleted the martas/1434 branch June 25, 2026 10:50
wprazuch added a commit that referenced this pull request Jun 25, 2026
…tion #12) (#1599)

## What

Adds **`gym env validate`** (+ `ng_validate` / `nemo_gym_validate`
deprecated shims) — runs the full config parse with **no Ray and no
server subprocesses**, then exits **0 (valid) / 1 (invalid)** with a
clean, rich-escaped message (**no traceback**). Returns in well under a
second instead of after a ~30–60s Ray bootstrap.

```bash
gym env validate --config resources_servers/<env>/configs/<env>.yaml --config responses_api_models/<model>/configs/<model>.yaml
gym env validate --benchmark gsm8k --model-type openai_model
```

## How

`validate()` lives in `cli/env.py` and is registered as `env validate`
in the `gym` router (`cli/main.py` COMMANDS) with the same
config-selection flags as `env start` (`--config`, `--benchmark`,
`--environment`, `--resources-server`, `--model-type`, `--search-dir`,
`--model*`). It reuses the same `get_global_config_dict()` parse path
the other commands use, so the validation checks stay in sync:

- **config_paths** resolution — missing/typo'd
([#1488](#1488)) and malformed
([#1490](#1490))
- **server cross-references** — unknown `name:` refs
([#1561](#1561))
- **mandatory `???`** values
([#1575](#1575))
- **schema** (`BaseNeMoGymCLIConfig`)

Wrapped in `exit_cleanly_on_config_error` (from #1609) so any
`ConfigError` becomes a clean message + `exit 1`. A dummy `policy_model`
is injected (the `NO_MODEL` parser config, as in `gym list` / `env
compose`) so model interpolations like `${policy_base_url}` resolve
without real creds — validation is about config **well-formedness**; the
real model is supplied by the `--model*` flags at run time.

## Targets `main`

Originally drafted on the unified-CLI epic branch; rebuilt directly on
`main` now that [#1630](#1630)
(and #1637/#1609/#1635/#1671) have merged. The old branch contents (a
snapshot of the CLI refactor + unrelated CI commits) were superseded and
replaced.

## Scope note

The zero-server check
([#1489](#1489), "nothing
configured to run") is intentionally **not** part of `validate`:
`NO_MODEL` injects a dummy model server (which would defeat the check),
and "is anything configured to run" is a *start*-time concern already
enforced by `gym env start` before Ray init. `validate` focuses on
config well-formedness.

## Why

Epic [#1205](#1205) friction
#12 (no config validation tooling) — the M1 "fast failure triage"
deliverable. Config errors otherwise only surface after Ray starts
(~30–60s).

## Tests

- `test_cli_main.py`: `gym env validate --config X` routes to
`nemo_gym.cli.env:validate` with `+config_paths=[X]` (added to the
parametrized config-command matrix).
- `test_cli.py`: `validate()` prints OK on a valid config; a raised
`ConfigError` becomes `exit 1` (no traceback).
- All `test_cli` + `test_cli_main` + `test_cli_legacy` pass (the only
failures are the pre-existing Python-3.12 `TestDidYouMean` argparse
issue on `main`); ruff + pre-commit clean. Smoke-tested end-to-end: `✓
Config is valid.` on a real benchmark, clean error + `exit 1` on a bad
path, and the `ng_validate` deprecation shim.

---------

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
ritaneves pushed a commit that referenced this pull request Jun 25, 2026
:wq

Gym CLI refactor (#1630)

Implements #1434

Notes:

- **Hydra dash-flags no longer supported.** The `gym` CLI parses its own
flags, so
Hydra's native `--multirun`, `--config-dir`, etc. are no longer
supported. W assume blast radius is small (we don't have them in docs,
most users already use overrides).
- **Hydra diagnostics now go to stderr.** Hydra's log handler is
redirected inside
`get_global_config_dict()`, so its diagnostic lines move from stdout to
stderr (this also
affects legacy `ng_*` commands). This was needed to keep the stdout
clean for `--json`/piping.

---------

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Rita Fernandes Neves <rfernandesne@nvidia.com>
gwarmstrong added a commit to gwarmstrong/Gym that referenced this pull request Jun 25, 2026
Resolve conflict in fern/.../reference/cli-commands.mdx: upstream's CLI
refactor (NVIDIA-NeMo#1630) replaced ng_collect_rollouts docs with the 'gym eval run'
command. Re-applied the per-agent num_repeats (dict-form) documentation onto
the new --num-repeats flag (table entry + example). rollout_collection.py and
test_rollout_collection.py auto-merged cleanly; all 38 unit tests pass.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Epic: Gym CLI usability -- first class CLI experience

5 participants