Skip to content

Use upstream NeMo-RL entrypoints and simplify sandbox integration#1381

Open
wedu-nvidia wants to merge 9 commits into
mainfrom
wedu/nemo-rl-update
Open

Use upstream NeMo-RL entrypoints and simplify sandbox integration#1381
wedu-nvidia wants to merge 9 commits into
mainfrom
wedu/nemo-rl-update

Conversation

@wedu-nvidia
Copy link
Copy Markdown
Collaborator

@wedu-nvidia wedu-nvidia commented Apr 17, 2026

Summary by CodeRabbit

  • New Features

    • Added config override and container image options for GRPO and SFT pipelines
    • Added Ray sandbox startup templates for multi-node cluster bring-up and optional sandboxed execution
  • Refactor

    • Switched from local build to a pinned upstream runtime image
    • Training pipelines now invoke upstream training entrypoints and accept configurable training configs
    • Removed legacy on-host training entrypoints and local SFT config files
    • Improved sandbox behavior for Ray-enabled heterogeneous jobs
  • Bug Fixes

    • SLURM executor now fails the job if any task exits non-zero (stricter failure handling)
  • Documentation

    • Cleared references to deprecated script paths

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removed the local NeMo‑RL Dockerfile and local SFT/GRPO entrypoints; pinned and use an upstream NeMo‑RL container image; updated pipelines to call upstream entrypoints with new CLI options and config mounting; added Ray sandbox SLURM templates and adjusted SLURM/Ray sandbox orchestration and SFT config references.

Changes

Cohort / File(s) Summary
Dockerfile removed
dockerfiles/Dockerfile.nemo-rl
Deleted local multi-stage NeMo‑RL Dockerfile and its hermetic build/prefetch steps.
Container mapping / cluster examples
nemo_skills/__init__.py, cluster_configs/example-local.yaml, tests/gpu-tests/test-local.yaml
Switched nemo-rl executor from a local Dockerfile build to pinned remote image nvcr.io/nvidian/nemo-rl:733c7d6-48669320.
Pipeline — GRPO
nemo_skills/pipeline/nemo_rl/grpo.py
Added upstream path constants; invoke upstream GRPO script/config; added CLI options config, container, sandbox_container; added NemoRLTask.config_path; remapped data CLI keys and PYTHONPATH/UV_PROJECT to upstream root; mount absolute configs outside upstream root.
Pipeline — SFT
nemo_skills/pipeline/nemo_rl/sft.py
Added upstream path constants; switched to upstream SFT entrypoint (run_sft.py) using config_path; added config and container CLI options; remapped data CLI keys and environment wiring; unified container selection and config mounting behavior.
Pipeline — executor / sandbox logic
nemo_skills/pipeline/utils/exp.py
Added --kill-on-bad-exit to srun args; adjusted heterogeneous-group counting/order and GPU partitioning; when using Ray, inject SANDBOX_* env vars into main task and skip standalone sandbox executor so Ray template manages sandbox.
Removed local training orchestration
nemo_skills/training/nemo_rl/start_grpo.py, nemo_skills/training/nemo_rl/start_sft.py
Deleted both local training entrypoint scripts and their helper classes/functions; orchestration delegated to upstream NeMo‑RL entrypoints.
SFT config removed
nemo_skills/training/nemo_rl/configs/sft.yaml
Removed the entire SFT YAML configuration file (all training/policy/data/checkpointing/logger/cluster keys deleted).
Docs
docs/basics/cluster-configs.md, docs/pipelines/training.md
Removed references to local start_sft.py and a sentence about forwarding extra parameters.
Ray sandbox templates added
nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2, ..._enroot.sub.j2
Added two Jinja SLURM templates to provision multi-node Ray clusters (head/workers), optional sandbox containers, port/env management, readiness polling, retries, log sync, attach helpers, and ENSURE worker readiness counting.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(30,144,255,0.5)
    participant CLI as Pipeline CLI
    participant Orchestrator as exp.add_task / Scheduler
    participant SLURM as SLURM Job / srun
    participant RayTemplate as Ray Sandbox Template
    participant Container as Upstream NeMo‑RL Container
  end
  CLI->>Orchestrator: submit job (opts: config, container, sandbox, ray_template)
  Orchestrator->>SLURM: compute het groups, node counts, sandbox strategy
  alt with Ray
    Orchestrator->>SLURM: inject SANDBOX_* env into main task
    SLURM->>RayTemplate: run Ray head/workers + optional sandbox
    RayTemplate->>Container: start head/workers, run SANDBOX_COMMAND inside container
  else without Ray
    Orchestrator->>SLURM: create per-het-group executors and sandbox sidecars
    SLURM->>Container: run executors and sandbox sidecars via srun
  end
  SLURM->>Container: execute upstream entrypoint (run_sft/run_grpo) with --config and mounted paths
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • activatedgeek
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly reflects the main objective: migrating from local NeMo-RL setup to upstream entrypoints and improving sandbox integration with Ray templates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wedu/nemo-rl-update

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
dockerfiles/Dockerfile.nemo-rl (1)

7-7: Consider adding a non-root user for container runtime.

Trivy flagged that the final USER in this Dockerfile is root. While this may be intentional for training workloads requiring elevated permissions, running as non-root is a security best practice. If root access isn't strictly required at runtime, consider adding a non-root user.

💡 Optional fix
 RUN uv pip install /opt/NeMo-Skills
+
+# Optional: Run as non-root user at runtime
+RUN useradd -m -u 1000 nemorl
+USER nemorl
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dockerfiles/Dockerfile.nemo-rl` at line 7, The Dockerfile currently ends with
"USER root", which keeps the container running as root; replace this by creating
and switching to a non-root user (e.g., create a user/group like "appuser" or
"nemo" during the image build and end with "USER appuser") and ensure any
runtime directories or files used by the process are owned or writable by that
user (use chown/chmod during build for directories the service needs). Verify
that any entrypoint or startup scripts (if referenced) do not require root-only
operations or elevate privileges only when necessary. Ensure the new non-root
username matches the final USER instruction and adjust permissions for any
volumes or bind-mounted paths the container needs to write to.
nemo_skills/pipeline/utils/exp.py (1)

536-539: Variable client_num_gpus is reassigned later in the function.

client_num_gpus is computed at line 536 for the server_goes_first logic, but then reassigned at line 638 with different logic (num_gpus if (server_config is None or num_nodes > 1) else 0). This reassignment appears intentional (different contexts), but the same variable name could cause confusion. Consider renaming one to clarify the distinct purposes.

Also applies to: 638-638

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/exp.py` around lines 536 - 539, The variable
client_num_gpus is computed early for ordering logic and then reassigned later
with different semantics; rename the first use to a distinct, descriptive name
(e.g., client_num_gpus_for_order or client_gpus_hint) and update its reference
in the server_goes_first assignment so it only represents the ordering check
(server_goes_first = server_needs_gpus and not client_num_gpus_for_order). Leave
the later variable (client_num_gpus computed from
num_gpus/num_nodes/server_config) as-is or rename it to client_num_gpus_actual
if you prefer clarity, and update any subsequent uses to the new name
(references: client_num_gpus, server_goes_first, num_gpus, server_config,
num_nodes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 423-425: The guard checking dependent_jobs is inconsistent with
its error text and sft.py; update the conditional that reads "if dependent_jobs
> 0:" to use "if dependent_jobs >= 0:" (the block that raises
ValueError("training_data is required when dependent_jobs >= 0")) so the
condition and message match — look for the dependent_jobs / training_data
validation in grpo.py and make this change.

In `@nemo_skills/training/nemo_rl/configs/sft.yaml`:
- Around line 25-29: The config currently hardcodes absolute paths for
model_name and chat_template (see keys model_name and tokenizer.chat_template
and tokenizer.name), which will break outside the original cluster; change these
to environment-variable or placeholder values (e.g., ${MODEL_NAME} /
${CHAT_TEMPLATE} or a relative/default template name) and update any README or
config docs to state they must be overridden; ensure the code that loads this
config respects overrides (note ++policy.model_name is already used) and add a
validation step that raises a clear error if chat_template is unset or missing
so callers must supply/override tokenizer.chat_template before runtime.
- Around line 165-166: The local nemo_skills/training/nemo_rl/configs/sft.yaml
contains cluster-specific hardcoded paths in train_data_path and val_data_path
that are unused by the pipeline; either remove this local config or turn it into
a template: replace the hardcoded paths with placeholder values (e.g.
"<TRAIN_DATA_PATH>" and "<VAL_DATA_PATH>") or add a comment stating it's a
reference and that runtime overrides use Hydra ++data.train.data_path and
++data.validation.data_path; ensure the file header mentions it is not used by
default and can be overridden via --config so future readers don't rely on the
hardcoded paths.

---

Nitpick comments:
In `@dockerfiles/Dockerfile.nemo-rl`:
- Line 7: The Dockerfile currently ends with "USER root", which keeps the
container running as root; replace this by creating and switching to a non-root
user (e.g., create a user/group like "appuser" or "nemo" during the image build
and end with "USER appuser") and ensure any runtime directories or files used by
the process are owned or writable by that user (use chown/chmod during build for
directories the service needs). Verify that any entrypoint or startup scripts
(if referenced) do not require root-only operations or elevate privileges only
when necessary. Ensure the new non-root username matches the final USER
instruction and adjust permissions for any volumes or bind-mounted paths the
container needs to write to.

In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 536-539: The variable client_num_gpus is computed early for
ordering logic and then reassigned later with different semantics; rename the
first use to a distinct, descriptive name (e.g., client_num_gpus_for_order or
client_gpus_hint) and update its reference in the server_goes_first assignment
so it only represents the ordering check (server_goes_first = server_needs_gpus
and not client_num_gpus_for_order). Leave the later variable (client_num_gpus
computed from num_gpus/num_nodes/server_config) as-is or rename it to
client_num_gpus_actual if you prefer clarity, and update any subsequent uses to
the new name (references: client_num_gpus, server_goes_first, num_gpus,
server_config, num_nodes).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0e189349-d2f2-40e2-a87b-0050328fe955

📥 Commits

Reviewing files that changed from the base of the PR and between 420d006 and 2d5133f.

📒 Files selected for processing (9)
  • dockerfiles/Dockerfile.nemo-rl
  • docs/basics/cluster-configs.md
  • docs/pipelines/training.md
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/nemo_rl/sft.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/training/nemo_rl/configs/sft.yaml
  • nemo_skills/training/nemo_rl/start_grpo.py
  • nemo_skills/training/nemo_rl/start_sft.py
💤 Files with no reviewable changes (3)
  • docs/pipelines/training.md
  • nemo_skills/training/nemo_rl/start_grpo.py
  • nemo_skills/training/nemo_rl/start_sft.py

Comment thread nemo_skills/pipeline/nemo_rl/grpo.py Outdated
Comment thread nemo_skills/training/nemo_rl/configs/sft.yaml Outdated
Comment thread nemo_skills/training/nemo_rl/configs/sft.yaml Outdated
@wedu-nvidia wedu-nvidia changed the title Wedu/nemo rl update wedu/nemo rl update Apr 17, 2026
@wedu-nvidia wedu-nvidia changed the title wedu/nemo rl update Use upstream NeMo-RL entrypoints and simplify sandbox integration Apr 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 100-104: The template hardcodes node geometry (gpus_per_node=8 and
CPUS_PER_WORKER=112) which breaks Ray actor counting and causes CPU
oversubscription for non-8-GPU nodes; change the template to accept/consume
environment/template variables (e.g., GPUS_PER_NODE and CPUS_PER_WORKER) with
sensible defaults and use those instead of literals, and update
nemo_skills/pipeline/nemo_rl/sft.py so NUM_ACTORS and any cpus-per-task
calculations derive from the passed GPUS_PER_NODE and CPUS_PER_WORKER (or the
existing num_gpus input) rather than assuming 8 GPUs and 112 CPUs, ensuring
NUM_ACTORS math and --cpus-per-task use the parameterized values.
- Around line 289-293: The current get_container_pid() returns the first enroot
PID and can pick the wrong container; change get_container_pid(node,
container_name) to accept a container name pattern and use enroot list -f output
to filter by that name (e.g., match "ray-head" or "ray-worker-.*") before
printing the PID so it reliably returns the PID for the requested container;
update all callers (the sandbox start loop, extract_worker_units, and the final
attach/exec paths that currently call get_container_pid) to pass the appropriate
container name pattern (e.g., "ray-head" for head operations and "ray-worker-*"
for workers) so exec/attach and readiness checks target the correct container.

In `@nemo_skills/pipeline/nemo_rl/sft.py`:
- Around line 394-396: The code currently only remaps absolute config paths
outside UPSTREAM_NEMO_RL_ROOT, allowing repo-relative --config values
(training_config) to be passed through and potentially fail at runtime; update
the logic in the training_config handling to either (a) normalize repo-relative
paths by resolving them into the mounted repo location
(/nemo_run/code/<relative>) using get_mounted_path(cluster_config, ...) when
training_config is not absolute and does not equal UPSTREAM_SFT_CONFIG, or (b)
explicitly reject non-upstream relative inputs with a clear error message
(raise/exit) so callers of --config get immediate feedback; modify the block
referencing training_config, UPSTREAM_SFT_CONFIG, UPSTREAM_NEMO_RL_ROOT and
get_mounted_path to implement one of these behaviors and ensure cluster_config
is used to derive the mounted repo path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b6ed4524-6203-4b32-8cb1-f57723936387

📥 Commits

Reviewing files that changed from the base of the PR and between fa59f66 and 66a6480.

📒 Files selected for processing (4)
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2
  • nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2
  • nemo_skills/pipeline/nemo_rl/sft.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/nemo_rl/grpo.py

Comment on lines +100 to +104
# Number of GPUs per node
gpus_per_node=8

# Number of CPUs per worker node
CPUS_PER_WORKER=${CPUS_PER_WORKER:-112}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parameterize GPU/CPU geometry instead of baking in one node shape.

This template locks Ray accounting to gpus_per_node=8 and CPUS_PER_WORKER=112, while nemo_skills/pipeline/nemo_rl/sft.py still accepts arbitrary num_gpus. Non-8-GPU jobs will miscount NUM_ACTORS, and smaller nodes can be oversubscribed immediately by --cpus-per-task=$CPUS_PER_WORKER.

Also applies to: 343-343

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 100 - 104, The template hardcodes node geometry (gpus_per_node=8
and CPUS_PER_WORKER=112) which breaks Ray actor counting and causes CPU
oversubscription for non-8-GPU nodes; change the template to accept/consume
environment/template variables (e.g., GPUS_PER_NODE and CPUS_PER_WORKER) with
sensible defaults and use those instead of literals, and update
nemo_skills/pipeline/nemo_rl/sft.py so NUM_ACTORS and any cpus-per-task
calculations derive from the passed GPUS_PER_NODE and CPUS_PER_WORKER (or the
existing num_gpus input) rather than assuming 8 GPUs and 112 CPUs, ensuring
NUM_ACTORS math and --cpus-per-task use the parameterized values.

Comment on lines +289 to +293
# Helper function to get container PID using enroot (workaround for --overlap --container-name bug)
get_container_pid() {
local node=$1
srun --overlap --nodes=1 -w "$node" bash -c "enroot list -f | awk 'NR>1 && \$2 ~ /^[0-9]+\$/ {print \$2; exit}'"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files "*nemo_skills_sandbox_ray_enroot*"

Repository: NVIDIA-NeMo/Skills

Length of output: 142


🏁 Script executed:

cat -n nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 4686


🏁 Script executed:

sed -n '280,350p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 2857


🏁 Script executed:

sed -n '315,345p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 1327


🏁 Script executed:

sed -n '345,450p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 3739


🏁 Script executed:

sed -n '420,440p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 805


🏁 Script executed:

sed -n '500,510p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 722


🏁 Script executed:

sed -n '495,530p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 2064


🏁 Script executed:

grep -n "container-name" nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 605


🏁 Script executed:

sed -n '280,295p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 583


🏁 Script executed:

sed -n '315,345p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 1327


🏁 Script executed:

sed -n '315,340p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 1245


🏁 Script executed:

enroot list --help 2>&1 | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 106


🏁 Script executed:

rg "enroot list" nemo_skills/pipeline/nemo_rl/ray_templates/

Repository: NVIDIA-NeMo/Skills

Length of output: 482


🌐 Web query:

enroot container name lookup query PID

💡 Result:

To look up the Enroot container name for a given PID, use the enroot list -f command, which scans all potential Enroot processes and displays their associated container names, PIDs, and other details. Example output: NAME PID COMM STATE STARTED TIME MNTNS USERNS COMMAND ubuntu 1583996 S 05:08 01:51:13 4026532603 4026532602 /bin/sh /root/test/1597122491.2123 This lists running Enroot containers with their PIDs. To find a specific PID, grep the output: enroot list -f | grep Internally, Enroot implements this in src/runtime.sh runtime::list function: - Lists directories in $ENROOT_DATA_PATH (default ~/.local/share/enroot/, container rootfs dirs). - Uses lsns -n -r -t mnt -o pid to get PIDs with mount namespaces. - For each PID, checks if /proc/${pid}/root/.lock exists (Enroot lock file indicating container root). - Parses /proc/${pid}/mountinfo to extract the root mount source path. - Strips the $ENROOT_DATA_PATH prefix to get the container name (basename of rootfs dir). - If no match, shows "". - Then runs ps on matching PIDs for details. To manually query for a specific PID without listing all: 1. Check if /proc/PID/root/.lock exists (confirms Enroot container). 2. Extract root mount from /proc/PID/mountinfo: find line where mount_point is "/" , take the "root" field (4th field), which is the path relative to filesystem, typically the full host path to rootfs like /home/user/.local/share/enroot/ubuntu. 3. Container name is basename of that path (e.g., "ubuntu"). Example awk one-liner for mountinfo root: awk '$5 == "/" {for(i=4; i<NF && $i != "-"; i++) printf "%s ", $i; print ""}' /proc/PID/mountinfo | xargs dirname | xargs basename This derives the name from the filesystem path without Enroot.

Citations:


🏁 Script executed:

sed -n '295,310p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2

Repository: NVIDIA-NeMo/Skills

Length of output: 802


Use the Ray container name to look up the correct container PID, not the first enroot PID.

get_container_pid() returns the first container PID on a node from enroot list. When the sandbox container starts on all nodes (lines 317-338) without a --container-name, multiple containers exist on the head node. Since the first PID from enroot list output is non-deterministic, get_container_pid() may select the sandbox container instead of the ray-head container. This causes extract_worker_units() (line 423), the final driver command (line 502), and the attach flow (lines 501-504) to exec into the wrong container, making readiness checks and job execution non-deterministic.

Modify get_container_pid() to filter by container name (e.g., ray-head or ray-worker-*) rather than returning the first result.

Also applies to: 317-338, 423-432, 501-504

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 289 - 293, The current get_container_pid() returns the first enroot
PID and can pick the wrong container; change get_container_pid(node,
container_name) to accept a container name pattern and use enroot list -f output
to filter by that name (e.g., match "ray-head" or "ray-worker-.*") before
printing the PID so it reliably returns the PID for the requested container;
update all callers (the sandbox start loop, extract_worker_units, and the final
attach/exec paths that currently call get_container_pid) to pass the appropriate
container name pattern (e.g., "ray-head" for head operations and "ray-worker-*"
for workers) so exec/attach and readiness checks target the correct container.

Comment on lines +394 to +396
training_config = config or UPSTREAM_SFT_CONFIG
if training_config.startswith("/") and not training_config.startswith(UPSTREAM_NEMO_RL_ROOT):
training_config = get_mounted_path(cluster_config, training_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize or reject repo-relative --config paths.

Only absolute configs outside /opt/nemo-rl are remapped here. A relative override is passed through unchanged, but this PR’s Ray launch templates run COMMAND from COMMAND_WORKDIR (the submit directory), not /nemo_run/code, so --config some/local.yaml can fail even though the new flag advertises arbitrary YAML overrides. Either rewrite repo-local paths to /nemo_run/code/... or fail fast on non-upstream relative inputs.

Suggested fix
     training_config = config or UPSTREAM_SFT_CONFIG
-    if training_config.startswith("/") and not training_config.startswith(UPSTREAM_NEMO_RL_ROOT):
-        training_config = get_mounted_path(cluster_config, training_config)
+    if training_config.startswith("/"):
+        if not training_config.startswith(UPSTREAM_NEMO_RL_ROOT):
+            training_config = get_mounted_path(cluster_config, training_config)
+    elif config is not None:
+        training_config = f"/nemo_run/code/{training_config}"

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/sft.py` around lines 394 - 396, The code
currently only remaps absolute config paths outside UPSTREAM_NEMO_RL_ROOT,
allowing repo-relative --config values (training_config) to be passed through
and potentially fail at runtime; update the logic in the training_config
handling to either (a) normalize repo-relative paths by resolving them into the
mounted repo location (/nemo_run/code/<relative>) using
get_mounted_path(cluster_config, ...) when training_config is not absolute and
does not equal UPSTREAM_SFT_CONFIG, or (b) explicitly reject non-upstream
relative inputs with a clear error message (raise/exit) so callers of --config
get immediate feedback; modify the block referencing training_config,
UPSTREAM_SFT_CONFIG, UPSTREAM_NEMO_RL_ROOT and get_mounted_path to implement one
of these behaviors and ensure cluster_config is used to derive the mounted repo
path.

@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

I updated the code to support local ray template for sandbox

Comment thread dockerfiles/Dockerfile.nemo-rl Outdated
# NOTICES.txt file points to where the OSS source code is archived
RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \
echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt
WORKDIR /opt/NeMo-Skills
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need nemo skills to be installed? Ideally we just use the corresponding nightly in cluster configs and not even have this dockerfile

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need it, I put it in case later we might need nemo-skills for this.
I think we can remove the dockerfile completely now

Comment thread nemo_skills/pipeline/nemo_rl/grpo.py Outdated
sandbox_container: str = typer.Option(None, help="Override container image for the sandbox sidecar"),
ray_template: str = typer.Option(
None,
help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to Ray-backed jobs; local executor ignores it.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to Ray-backed jobs; local executor ignores it.",
help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to slurm-based jobs; local executor ignores it.",

Comment thread nemo_skills/pipeline/nemo_rl/grpo.py Outdated
validation_data = get_mounted_path(cluster_config, validation_data)
if run_conversion_only:
dependent_jobs = -1
if dependent_jobs > 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dependent_jobs > 0:
if dependent_jobs >= 0:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@@ -2,118 +2,123 @@
sft:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not available in nemo-rl directly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nano-v3 yaml has not merged, and I can put it for reference, I can remove it now or after nemo-rl pr merged.

Comment thread nemo_skills/pipeline/utils/exp.py Outdated
het_group = 0
het_group_indices = []
total_het_groups = (n_servers if server_config is not None else 0) + bool(cmd) + with_sandbox
het_group_num_nodes = {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fully sure i understand this change, @gwarmstrong could you also have a look?

@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

wedu-nvidia commented Apr 21, 2026

@Kipok
I removed below

  • the local Dockerfile.nemo-rl
  • the local nemo_rl/configs/sft.yaml
  • SFT-specific template override exposure

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 (1)

289-293: ⚠️ Potential issue | 🟠 Major

get_container_pid() may return wrong container PID when sandbox is running.

This function returns the first container PID from enroot list, but when the sandbox container starts (lines 316-341), multiple containers exist on the same node. The function may return the sandbox container's PID instead of ray-head, causing ray status checks and job execution to fail or behave non-deterministically.

Consider filtering by container name pattern. For example, enroot containers often include the image name in their identifier.

#!/bin/bash
# Check how enroot list -f output looks and if container names are distinguishable
# This helps understand if filtering by name is feasible

# Look for enroot-related documentation or usage in the codebase
rg -n "enroot list" --type-add 'jinja:*.j2' --type jinja
rg -n "container-name" nemo_skills/pipeline/nemo_rl/ray_templates/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 289 - 293, get_container_pid() currently returns the first PID from
enroot list which can pick the sandbox container; change it to select the PID
for the ray-head container by filtering the enroot list output by the container
name or image pattern (look for "ray-head" or the Ray image tag used when
launching the head) before extracting the PID, i.e. modify the logic in
get_container_pid to match container identifiers by name/pattern rather than
taking the first numeric PID so subsequent ray status checks and job runs target
the correct container.
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/exp.py (1)

611-611: Variable shadowing: client_num_gpus is redefined.

client_num_gpus is first defined at line 537, then redefined at line 611 inside the loop. While the logic appears correct (the inner one uses different conditions), this shadowing can cause confusion during maintenance. Consider using a different name like main_task_gpus for the inner variable.

♻️ Rename to avoid shadowing
-            client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0
+            main_task_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0
             main_env_updates = {"NEMO_SKILLS_SANDBOX_PORT": sandbox_port}
             if with_ray:
-                main_env_updates["GPUS_PER_NODE"] = str(client_num_gpus)
+                main_env_updates["GPUS_PER_NODE"] = str(main_task_gpus)

And update the executor call at line 632:

-                        gpus_per_node=client_num_gpus,
+                        gpus_per_node=main_task_gpus,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/exp.py` at line 611, The inner redefinition of
client_num_gpus inside the loop shadows the earlier variable; rename the inner
variable (e.g., to main_task_gpus or task_client_gpus) where it is assigned with
the expression "num_gpus if (server_config is None or num_nodes > 1) else 0" and
update any subsequent uses (notably the executor call that currently references
client_num_gpus) to use the new name so the outer client_num_gpus remains
unchanged and the intent is clear.
nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 (1)

411-413: Dead code: condition is never true.

Same issue as the other template. The loop starts at i = 1, so $i -eq 0 is never satisfied.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 411 - 413, The conditional checking if [[ $i -eq 0 ]] inside the
loop is dead because the loop starts at i=1; either remove the unreachable
branch or change the check to the correct index (e.g., [[ $i -eq 1 ]]) depending
on intended behavior; update the template snippet that sets
OVERLAP_HEAD_AND_WORKER_ARG (the block containing
OVERLAP_HEAD_AND_WORKER_ARG="--overlap") so it executes for the actual
first-iteration index or is removed if not needed.
nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2 (1)

403-405: Dead code: condition is never true.

The loop starts at i = 1, so the condition $i -eq 0 can never be true. The OVERLAP_HEAD_AND_WORKER_ARG variable is never set. If workers don't need --overlap, this block can be removed. If overlap was intended for worker 0 (which doesn't exist in this loop since the head is on node 0), the logic needs rethinking.

🧹 Remove dead code
   srun {{ common_srun_args }} ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" &
+  srun {{ common_srun_args }} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" &
-  if [[ $i -eq 0 ]]; then
-    OVERLAP_HEAD_AND_WORKER_ARG="--overlap"
-  fi
-  srun {{ common_srun_args }} ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" &
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2`
around lines 403 - 405, The conditional block that checks "if [[ $i -eq 0 ]]"
inside the worker loop is dead because the loop starts at i=1, so
OVERLAP_HEAD_AND_WORKER_ARG is never set; remove the entire if-block (the lines
setting OVERLAP_HEAD_AND_WORKER_ARG and its fi) from
nemo_skills_sandbox_ray.sub.j2, or if the intent was to enable overlap for the
head, move/convert that logic to the head-creation branch (where node 0/head is
created) instead of inside the worker loop; reference symbols: the loop variable
i and OVERLAP_HEAD_AND_WORKER_ARG.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 289-293: get_container_pid() currently returns the first PID from
enroot list which can pick the sandbox container; change it to select the PID
for the ray-head container by filtering the enroot list output by the container
name or image pattern (look for "ray-head" or the Ray image tag used when
launching the head) before extracting the PID, i.e. modify the logic in
get_container_pid to match container identifiers by name/pattern rather than
taking the first numeric PID so subsequent ray status checks and job runs target
the correct container.

---

Nitpick comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 411-413: The conditional checking if [[ $i -eq 0 ]] inside the
loop is dead because the loop starts at i=1; either remove the unreachable
branch or change the check to the correct index (e.g., [[ $i -eq 1 ]]) depending
on intended behavior; update the template snippet that sets
OVERLAP_HEAD_AND_WORKER_ARG (the block containing
OVERLAP_HEAD_AND_WORKER_ARG="--overlap") so it executes for the actual
first-iteration index or is removed if not needed.

In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2`:
- Around line 403-405: The conditional block that checks "if [[ $i -eq 0 ]]"
inside the worker loop is dead because the loop starts at i=1, so
OVERLAP_HEAD_AND_WORKER_ARG is never set; remove the entire if-block (the lines
setting OVERLAP_HEAD_AND_WORKER_ARG and its fi) from
nemo_skills_sandbox_ray.sub.j2, or if the intent was to enable overlap for the
head, move/convert that logic to the head-creation branch (where node 0/head is
created) instead of inside the worker loop; reference symbols: the loop variable
i and OVERLAP_HEAD_AND_WORKER_ARG.

In `@nemo_skills/pipeline/utils/exp.py`:
- Line 611: The inner redefinition of client_num_gpus inside the loop shadows
the earlier variable; rename the inner variable (e.g., to main_task_gpus or
task_client_gpus) where it is assigned with the expression "num_gpus if
(server_config is None or num_nodes > 1) else 0" and update any subsequent uses
(notably the executor call that currently references client_num_gpus) to use the
new name so the outer client_num_gpus remains unchanged and the intent is clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5b927823-65af-48cc-989f-8cdd66446574

📥 Commits

Reviewing files that changed from the base of the PR and between 66a6480 and 76ad2be.

📒 Files selected for processing (9)
  • cluster_configs/example-local.yaml
  • dockerfiles/Dockerfile.nemo-rl
  • nemo_skills/__init__.py
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2
  • nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/training/nemo_rl/configs/sft.yaml
  • tests/gpu-tests/test-local.yaml
💤 Files with no reviewable changes (2)
  • dockerfiles/Dockerfile.nemo-rl
  • nemo_skills/training/nemo_rl/configs/sft.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/nemo_rl/grpo.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 611-623: main_env_updates currently sets "SANDBOX_PORT" to
sandbox_port which may be an integer; make it consistently a string like
"GPUS_PER_NODE" by converting sandbox_port to str when updating main_env_updates
(update the block that builds main_env_updates before calling
temporary_env_update; reference variables/symbols main_env_updates,
SANDBOX_PORT, sandbox_port, temporary_env_update, and cluster_config to locate
the change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a959d48a-d304-44e6-9d54-3fa7d8ba200b

📥 Commits

Reviewing files that changed from the base of the PR and between 76ad2be and 0feb908.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/utils/exp.py

Comment thread nemo_skills/pipeline/utils/exp.py
Kipok
Kipok previously approved these changes Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, as long as tests pass. Did we launch slurm test (the omr one) here?

Comment thread nemo_skills/pipeline/nemo_rl/grpo.py Outdated
help="If specified, will reuse the code from this experiment. "
"Can provide an experiment name or an experiment object if running from code.",
),
config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config"),
config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config for nano-v3"),

@Kipok Kipok dismissed their stale review April 23, 2026 00:38

might need more changes

nemo-skills: dockerfile:dockerfiles/Dockerfile.nemo-skills
verl: dockerfile:dockerfiles/Dockerfile.verl
nemo-rl: dockerfile:dockerfiles/Dockerfile.nemo-rl
nemo-rl: nvcr.io/nvidian/nemo-rl:733c7d6-48669320
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this internal container? We ideally want something publicly available here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for nightly container published internally, but not sure if public ones are up to date, I will have a check

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok I contacted the team and was told public container is released for every 2 months roughly, it seems too old then?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's ok to keep it as private as it has a commit in the name. Can you update the readme in dockerfiles folder to mention that it defaults to private container accessible to nvidia employees, but for anyone else they can build it from corresponding dockerfile / commit in nemo-rl repo? But one thing we should make sure is that our ci still works, so can you check what key / setting we need to set there and sync with me offline so that I can update our repo settings and we can test that gpu / gitlab ci passes

@@ -0,0 +1,502 @@
#!/bin/bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need two templates? If one was for lax cluster, we can probably not port it as it's not needed anymore

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for lax, I keep it here in case one day we need for lax types. But I can delete it now

@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

lgtm, as long as tests pass. Did we launch slurm test (the omr one) here?

not yet, I can test for this.

@wedu-nvidia wedu-nvidia force-pushed the wedu/nemo-rl-update branch from 894cee3 to 1a725b3 Compare April 28, 2026 16:51
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia force-pushed the wedu/nemo-rl-update branch from 1a725b3 to 0ad7b49 Compare April 28, 2026 16:53
@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

lgtm, as long as tests pass. Did we launch slurm test (the omr one) here?

sorry for late reply here, it seems omr cannot pass. I checked the sft loss curve which looks good to me, but eval results is bad, I will investigate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants