Use upstream NeMo-RL entrypoints and simplify sandbox integration#1381
Use upstream NeMo-RL entrypoints and simplify sandbox integration#1381wedu-nvidia wants to merge 9 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved the local NeMo‑RL Dockerfile and local SFT/GRPO entrypoints; pinned and use an upstream NeMo‑RL container image; updated pipelines to call upstream entrypoints with new CLI options and config mounting; added Ray sandbox SLURM templates and adjusted SLURM/Ray sandbox orchestration and SFT config references. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(30,144,255,0.5)
participant CLI as Pipeline CLI
participant Orchestrator as exp.add_task / Scheduler
participant SLURM as SLURM Job / srun
participant RayTemplate as Ray Sandbox Template
participant Container as Upstream NeMo‑RL Container
end
CLI->>Orchestrator: submit job (opts: config, container, sandbox, ray_template)
Orchestrator->>SLURM: compute het groups, node counts, sandbox strategy
alt with Ray
Orchestrator->>SLURM: inject SANDBOX_* env into main task
SLURM->>RayTemplate: run Ray head/workers + optional sandbox
RayTemplate->>Container: start head/workers, run SANDBOX_COMMAND inside container
else without Ray
Orchestrator->>SLURM: create per-het-group executors and sandbox sidecars
SLURM->>Container: run executors and sandbox sidecars via srun
end
SLURM->>Container: execute upstream entrypoint (run_sft/run_grpo) with --config and mounted paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
dockerfiles/Dockerfile.nemo-rl (1)
7-7: Consider adding a non-root user for container runtime.Trivy flagged that the final USER in this Dockerfile is
root. While this may be intentional for training workloads requiring elevated permissions, running as non-root is a security best practice. If root access isn't strictly required at runtime, consider adding a non-root user.💡 Optional fix
RUN uv pip install /opt/NeMo-Skills + +# Optional: Run as non-root user at runtime +RUN useradd -m -u 1000 nemorl +USER nemorl🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dockerfiles/Dockerfile.nemo-rl` at line 7, The Dockerfile currently ends with "USER root", which keeps the container running as root; replace this by creating and switching to a non-root user (e.g., create a user/group like "appuser" or "nemo" during the image build and end with "USER appuser") and ensure any runtime directories or files used by the process are owned or writable by that user (use chown/chmod during build for directories the service needs). Verify that any entrypoint or startup scripts (if referenced) do not require root-only operations or elevate privileges only when necessary. Ensure the new non-root username matches the final USER instruction and adjust permissions for any volumes or bind-mounted paths the container needs to write to.nemo_skills/pipeline/utils/exp.py (1)
536-539: Variableclient_num_gpusis reassigned later in the function.
client_num_gpusis computed at line 536 for theserver_goes_firstlogic, but then reassigned at line 638 with different logic (num_gpus if (server_config is None or num_nodes > 1) else 0). This reassignment appears intentional (different contexts), but the same variable name could cause confusion. Consider renaming one to clarify the distinct purposes.Also applies to: 638-638
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/exp.py` around lines 536 - 539, The variable client_num_gpus is computed early for ordering logic and then reassigned later with different semantics; rename the first use to a distinct, descriptive name (e.g., client_num_gpus_for_order or client_gpus_hint) and update its reference in the server_goes_first assignment so it only represents the ordering check (server_goes_first = server_needs_gpus and not client_num_gpus_for_order). Leave the later variable (client_num_gpus computed from num_gpus/num_nodes/server_config) as-is or rename it to client_num_gpus_actual if you prefer clarity, and update any subsequent uses to the new name (references: client_num_gpus, server_goes_first, num_gpus, server_config, num_nodes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 423-425: The guard checking dependent_jobs is inconsistent with
its error text and sft.py; update the conditional that reads "if dependent_jobs
> 0:" to use "if dependent_jobs >= 0:" (the block that raises
ValueError("training_data is required when dependent_jobs >= 0")) so the
condition and message match — look for the dependent_jobs / training_data
validation in grpo.py and make this change.
In `@nemo_skills/training/nemo_rl/configs/sft.yaml`:
- Around line 25-29: The config currently hardcodes absolute paths for
model_name and chat_template (see keys model_name and tokenizer.chat_template
and tokenizer.name), which will break outside the original cluster; change these
to environment-variable or placeholder values (e.g., ${MODEL_NAME} /
${CHAT_TEMPLATE} or a relative/default template name) and update any README or
config docs to state they must be overridden; ensure the code that loads this
config respects overrides (note ++policy.model_name is already used) and add a
validation step that raises a clear error if chat_template is unset or missing
so callers must supply/override tokenizer.chat_template before runtime.
- Around line 165-166: The local nemo_skills/training/nemo_rl/configs/sft.yaml
contains cluster-specific hardcoded paths in train_data_path and val_data_path
that are unused by the pipeline; either remove this local config or turn it into
a template: replace the hardcoded paths with placeholder values (e.g.
"<TRAIN_DATA_PATH>" and "<VAL_DATA_PATH>") or add a comment stating it's a
reference and that runtime overrides use Hydra ++data.train.data_path and
++data.validation.data_path; ensure the file header mentions it is not used by
default and can be overridden via --config so future readers don't rely on the
hardcoded paths.
---
Nitpick comments:
In `@dockerfiles/Dockerfile.nemo-rl`:
- Line 7: The Dockerfile currently ends with "USER root", which keeps the
container running as root; replace this by creating and switching to a non-root
user (e.g., create a user/group like "appuser" or "nemo" during the image build
and end with "USER appuser") and ensure any runtime directories or files used by
the process are owned or writable by that user (use chown/chmod during build for
directories the service needs). Verify that any entrypoint or startup scripts
(if referenced) do not require root-only operations or elevate privileges only
when necessary. Ensure the new non-root username matches the final USER
instruction and adjust permissions for any volumes or bind-mounted paths the
container needs to write to.
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 536-539: The variable client_num_gpus is computed early for
ordering logic and then reassigned later with different semantics; rename the
first use to a distinct, descriptive name (e.g., client_num_gpus_for_order or
client_gpus_hint) and update its reference in the server_goes_first assignment
so it only represents the ordering check (server_goes_first = server_needs_gpus
and not client_num_gpus_for_order). Leave the later variable (client_num_gpus
computed from num_gpus/num_nodes/server_config) as-is or rename it to
client_num_gpus_actual if you prefer clarity, and update any subsequent uses to
the new name (references: client_num_gpus, server_goes_first, num_gpus,
server_config, num_nodes).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0e189349-d2f2-40e2-a87b-0050328fe955
📒 Files selected for processing (9)
dockerfiles/Dockerfile.nemo-rldocs/basics/cluster-configs.mddocs/pipelines/training.mdnemo_skills/pipeline/nemo_rl/grpo.pynemo_skills/pipeline/nemo_rl/sft.pynemo_skills/pipeline/utils/exp.pynemo_skills/training/nemo_rl/configs/sft.yamlnemo_skills/training/nemo_rl/start_grpo.pynemo_skills/training/nemo_rl/start_sft.py
💤 Files with no reviewable changes (3)
- docs/pipelines/training.md
- nemo_skills/training/nemo_rl/start_grpo.py
- nemo_skills/training/nemo_rl/start_sft.py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 100-104: The template hardcodes node geometry (gpus_per_node=8 and
CPUS_PER_WORKER=112) which breaks Ray actor counting and causes CPU
oversubscription for non-8-GPU nodes; change the template to accept/consume
environment/template variables (e.g., GPUS_PER_NODE and CPUS_PER_WORKER) with
sensible defaults and use those instead of literals, and update
nemo_skills/pipeline/nemo_rl/sft.py so NUM_ACTORS and any cpus-per-task
calculations derive from the passed GPUS_PER_NODE and CPUS_PER_WORKER (or the
existing num_gpus input) rather than assuming 8 GPUs and 112 CPUs, ensuring
NUM_ACTORS math and --cpus-per-task use the parameterized values.
- Around line 289-293: The current get_container_pid() returns the first enroot
PID and can pick the wrong container; change get_container_pid(node,
container_name) to accept a container name pattern and use enroot list -f output
to filter by that name (e.g., match "ray-head" or "ray-worker-.*") before
printing the PID so it reliably returns the PID for the requested container;
update all callers (the sandbox start loop, extract_worker_units, and the final
attach/exec paths that currently call get_container_pid) to pass the appropriate
container name pattern (e.g., "ray-head" for head operations and "ray-worker-*"
for workers) so exec/attach and readiness checks target the correct container.
In `@nemo_skills/pipeline/nemo_rl/sft.py`:
- Around line 394-396: The code currently only remaps absolute config paths
outside UPSTREAM_NEMO_RL_ROOT, allowing repo-relative --config values
(training_config) to be passed through and potentially fail at runtime; update
the logic in the training_config handling to either (a) normalize repo-relative
paths by resolving them into the mounted repo location
(/nemo_run/code/<relative>) using get_mounted_path(cluster_config, ...) when
training_config is not absolute and does not equal UPSTREAM_SFT_CONFIG, or (b)
explicitly reject non-upstream relative inputs with a clear error message
(raise/exit) so callers of --config get immediate feedback; modify the block
referencing training_config, UPSTREAM_SFT_CONFIG, UPSTREAM_NEMO_RL_ROOT and
get_mounted_path to implement one of these behaviors and ensure cluster_config
is used to derive the mounted repo path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b6ed4524-6203-4b32-8cb1-f57723936387
📒 Files selected for processing (4)
nemo_skills/pipeline/nemo_rl/grpo.pynemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2nemo_skills/pipeline/nemo_rl/sft.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/pipeline/nemo_rl/grpo.py
| # Number of GPUs per node | ||
| gpus_per_node=8 | ||
|
|
||
| # Number of CPUs per worker node | ||
| CPUS_PER_WORKER=${CPUS_PER_WORKER:-112} |
There was a problem hiding this comment.
Parameterize GPU/CPU geometry instead of baking in one node shape.
This template locks Ray accounting to gpus_per_node=8 and CPUS_PER_WORKER=112, while nemo_skills/pipeline/nemo_rl/sft.py still accepts arbitrary num_gpus. Non-8-GPU jobs will miscount NUM_ACTORS, and smaller nodes can be oversubscribed immediately by --cpus-per-task=$CPUS_PER_WORKER.
Also applies to: 343-343
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 100 - 104, The template hardcodes node geometry (gpus_per_node=8
and CPUS_PER_WORKER=112) which breaks Ray actor counting and causes CPU
oversubscription for non-8-GPU nodes; change the template to accept/consume
environment/template variables (e.g., GPUS_PER_NODE and CPUS_PER_WORKER) with
sensible defaults and use those instead of literals, and update
nemo_skills/pipeline/nemo_rl/sft.py so NUM_ACTORS and any cpus-per-task
calculations derive from the passed GPUS_PER_NODE and CPUS_PER_WORKER (or the
existing num_gpus input) rather than assuming 8 GPUs and 112 CPUs, ensuring
NUM_ACTORS math and --cpus-per-task use the parameterized values.
| # Helper function to get container PID using enroot (workaround for --overlap --container-name bug) | ||
| get_container_pid() { | ||
| local node=$1 | ||
| srun --overlap --nodes=1 -w "$node" bash -c "enroot list -f | awk 'NR>1 && \$2 ~ /^[0-9]+\$/ {print \$2; exit}'" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files "*nemo_skills_sandbox_ray_enroot*"Repository: NVIDIA-NeMo/Skills
Length of output: 142
🏁 Script executed:
cat -n nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 | head -100Repository: NVIDIA-NeMo/Skills
Length of output: 4686
🏁 Script executed:
sed -n '280,350p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 2857
🏁 Script executed:
sed -n '315,345p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 1327
🏁 Script executed:
sed -n '345,450p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 3739
🏁 Script executed:
sed -n '420,440p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 805
🏁 Script executed:
sed -n '500,510p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 722
🏁 Script executed:
sed -n '495,530p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 2064
🏁 Script executed:
grep -n "container-name" nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 605
🏁 Script executed:
sed -n '280,295p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 583
🏁 Script executed:
sed -n '315,345p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 1327
🏁 Script executed:
sed -n '315,340p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 1245
🏁 Script executed:
enroot list --help 2>&1 | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 106
🏁 Script executed:
rg "enroot list" nemo_skills/pipeline/nemo_rl/ray_templates/Repository: NVIDIA-NeMo/Skills
Length of output: 482
🌐 Web query:
enroot container name lookup query PID
💡 Result:
To look up the Enroot container name for a given PID, use the enroot list -f command, which scans all potential Enroot processes and displays their associated container names, PIDs, and other details. Example output: NAME PID COMM STATE STARTED TIME MNTNS USERNS COMMAND ubuntu 1583996 S 05:08 01:51:13 4026532603 4026532602 /bin/sh /root/test/1597122491.2123 This lists running Enroot containers with their PIDs. To find a specific PID, grep the output: enroot list -f | grep Internally, Enroot implements this in src/runtime.sh runtime::list function: - Lists directories in
Citations:
- 1: is there any way to name enroot container so that the container could be filtered by enroot list -f easier? NVIDIA/enroot#33
- 2: enroot list -f behaviour on fedora NVIDIA/enroot#84
- 3: "enroot list -f" can fail if a container PID disappears NVIDIA/enroot#126
- 4: https://github.com/NVIDIA/enroot/blob/main/src/runtime.sh
- 5: https://github.com/NVIDIA/enroot/blob/main/enroot.in
- 6: https://www.man7.org/linux/man-pages/man5/proc_pid_mountinfo.5.html
🏁 Script executed:
sed -n '295,310p' nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2Repository: NVIDIA-NeMo/Skills
Length of output: 802
Use the Ray container name to look up the correct container PID, not the first enroot PID.
get_container_pid() returns the first container PID on a node from enroot list. When the sandbox container starts on all nodes (lines 317-338) without a --container-name, multiple containers exist on the head node. Since the first PID from enroot list output is non-deterministic, get_container_pid() may select the sandbox container instead of the ray-head container. This causes extract_worker_units() (line 423), the final driver command (line 502), and the attach flow (lines 501-504) to exec into the wrong container, making readiness checks and job execution non-deterministic.
Modify get_container_pid() to filter by container name (e.g., ray-head or ray-worker-*) rather than returning the first result.
Also applies to: 317-338, 423-432, 501-504
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`
around lines 289 - 293, The current get_container_pid() returns the first enroot
PID and can pick the wrong container; change get_container_pid(node,
container_name) to accept a container name pattern and use enroot list -f output
to filter by that name (e.g., match "ray-head" or "ray-worker-.*") before
printing the PID so it reliably returns the PID for the requested container;
update all callers (the sandbox start loop, extract_worker_units, and the final
attach/exec paths that currently call get_container_pid) to pass the appropriate
container name pattern (e.g., "ray-head" for head operations and "ray-worker-*"
for workers) so exec/attach and readiness checks target the correct container.
| training_config = config or UPSTREAM_SFT_CONFIG | ||
| if training_config.startswith("/") and not training_config.startswith(UPSTREAM_NEMO_RL_ROOT): | ||
| training_config = get_mounted_path(cluster_config, training_config) |
There was a problem hiding this comment.
Normalize or reject repo-relative --config paths.
Only absolute configs outside /opt/nemo-rl are remapped here. A relative override is passed through unchanged, but this PR’s Ray launch templates run COMMAND from COMMAND_WORKDIR (the submit directory), not /nemo_run/code, so --config some/local.yaml can fail even though the new flag advertises arbitrary YAML overrides. Either rewrite repo-local paths to /nemo_run/code/... or fail fast on non-upstream relative inputs.
Suggested fix
training_config = config or UPSTREAM_SFT_CONFIG
- if training_config.startswith("/") and not training_config.startswith(UPSTREAM_NEMO_RL_ROOT):
- training_config = get_mounted_path(cluster_config, training_config)
+ if training_config.startswith("/"):
+ if not training_config.startswith(UPSTREAM_NEMO_RL_ROOT):
+ training_config = get_mounted_path(cluster_config, training_config)
+ elif config is not None:
+ training_config = f"/nemo_run/code/{training_config}"As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/nemo_rl/sft.py` around lines 394 - 396, The code
currently only remaps absolute config paths outside UPSTREAM_NEMO_RL_ROOT,
allowing repo-relative --config values (training_config) to be passed through
and potentially fail at runtime; update the logic in the training_config
handling to either (a) normalize repo-relative paths by resolving them into the
mounted repo location (/nemo_run/code/<relative>) using
get_mounted_path(cluster_config, ...) when training_config is not absolute and
does not equal UPSTREAM_SFT_CONFIG, or (b) explicitly reject non-upstream
relative inputs with a clear error message (raise/exit) so callers of --config
get immediate feedback; modify the block referencing training_config,
UPSTREAM_SFT_CONFIG, UPSTREAM_NEMO_RL_ROOT and get_mounted_path to implement one
of these behaviors and ensure cluster_config is used to derive the mounted repo
path.
|
I updated the code to support local ray template for sandbox |
| # NOTICES.txt file points to where the OSS source code is archived | ||
| RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \ | ||
| echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt | ||
| WORKDIR /opt/NeMo-Skills |
There was a problem hiding this comment.
do we need nemo skills to be installed? Ideally we just use the corresponding nightly in cluster configs and not even have this dockerfile
There was a problem hiding this comment.
I think we do not need it, I put it in case later we might need nemo-skills for this.
I think we can remove the dockerfile completely now
| sandbox_container: str = typer.Option(None, help="Override container image for the sandbox sidecar"), | ||
| ray_template: str = typer.Option( | ||
| None, | ||
| help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to Ray-backed jobs; local executor ignores it.", |
There was a problem hiding this comment.
| help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to Ray-backed jobs; local executor ignores it.", | |
| help="Override the Ray template for training tasks. Can be a template name or a local .j2 path. Only applies to slurm-based jobs; local executor ignores it.", |
| validation_data = get_mounted_path(cluster_config, validation_data) | ||
| if run_conversion_only: | ||
| dependent_jobs = -1 | ||
| if dependent_jobs > 0: |
There was a problem hiding this comment.
| if dependent_jobs > 0: | |
| if dependent_jobs >= 0: |
| @@ -2,118 +2,123 @@ | |||
| sft: | |||
There was a problem hiding this comment.
is this not available in nemo-rl directly?
There was a problem hiding this comment.
The nano-v3 yaml has not merged, and I can put it for reference, I can remove it now or after nemo-rl pr merged.
| het_group = 0 | ||
| het_group_indices = [] | ||
| total_het_groups = (n_servers if server_config is not None else 0) + bool(cmd) + with_sandbox | ||
| het_group_num_nodes = {} |
There was a problem hiding this comment.
not fully sure i understand this change, @gwarmstrong could you also have a look?
|
@Kipok
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 (1)
289-293:⚠️ Potential issue | 🟠 Major
get_container_pid()may return wrong container PID when sandbox is running.This function returns the first container PID from
enroot list, but when the sandbox container starts (lines 316-341), multiple containers exist on the same node. The function may return the sandbox container's PID instead ofray-head, causingray statuschecks and job execution to fail or behave non-deterministically.Consider filtering by container name pattern. For example, enroot containers often include the image name in their identifier.
#!/bin/bash # Check how enroot list -f output looks and if container names are distinguishable # This helps understand if filtering by name is feasible # Look for enroot-related documentation or usage in the codebase rg -n "enroot list" --type-add 'jinja:*.j2' --type jinja rg -n "container-name" nemo_skills/pipeline/nemo_rl/ray_templates/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2` around lines 289 - 293, get_container_pid() currently returns the first PID from enroot list which can pick the sandbox container; change it to select the PID for the ray-head container by filtering the enroot list output by the container name or image pattern (look for "ray-head" or the Ray image tag used when launching the head) before extracting the PID, i.e. modify the logic in get_container_pid to match container identifiers by name/pattern rather than taking the first numeric PID so subsequent ray status checks and job runs target the correct container.
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/exp.py (1)
611-611: Variable shadowing:client_num_gpusis redefined.
client_num_gpusis first defined at line 537, then redefined at line 611 inside the loop. While the logic appears correct (the inner one uses different conditions), this shadowing can cause confusion during maintenance. Consider using a different name likemain_task_gpusfor the inner variable.♻️ Rename to avoid shadowing
- client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0 + main_task_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0 main_env_updates = {"NEMO_SKILLS_SANDBOX_PORT": sandbox_port} if with_ray: - main_env_updates["GPUS_PER_NODE"] = str(client_num_gpus) + main_env_updates["GPUS_PER_NODE"] = str(main_task_gpus)And update the executor call at line 632:
- gpus_per_node=client_num_gpus, + gpus_per_node=main_task_gpus,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/exp.py` at line 611, The inner redefinition of client_num_gpus inside the loop shadows the earlier variable; rename the inner variable (e.g., to main_task_gpus or task_client_gpus) where it is assigned with the expression "num_gpus if (server_config is None or num_nodes > 1) else 0" and update any subsequent uses (notably the executor call that currently references client_num_gpus) to use the new name so the outer client_num_gpus remains unchanged and the intent is clear.nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2 (1)
411-413: Dead code: condition is never true.Same issue as the other template. The loop starts at
i = 1, so$i -eq 0is never satisfied.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2` around lines 411 - 413, The conditional checking if [[ $i -eq 0 ]] inside the loop is dead because the loop starts at i=1; either remove the unreachable branch or change the check to the correct index (e.g., [[ $i -eq 1 ]]) depending on intended behavior; update the template snippet that sets OVERLAP_HEAD_AND_WORKER_ARG (the block containing OVERLAP_HEAD_AND_WORKER_ARG="--overlap") so it executes for the actual first-iteration index or is removed if not needed.nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2 (1)
403-405: Dead code: condition is never true.The loop starts at
i = 1, so the condition$i -eq 0can never be true. TheOVERLAP_HEAD_AND_WORKER_ARGvariable is never set. If workers don't need--overlap, this block can be removed. If overlap was intended for worker 0 (which doesn't exist in this loop since the head is on node 0), the logic needs rethinking.🧹 Remove dead code
srun {{ common_srun_args }} ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" & + srun {{ common_srun_args }} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" & - if [[ $i -eq 0 ]]; then - OVERLAP_HEAD_AND_WORKER_ARG="--overlap" - fi - srun {{ common_srun_args }} ${OVERLAP_HEAD_AND_WORKER_ARG:-} --container-name=ray-worker-$i --exact --nodes=1 --ntasks=1 --cpus-per-task=$CPUS_PER_WORKER -w "$node_i" -o $LOG_DIR/{{ ray_log_prefix }}worker-$i.log bash -x -c "$worker_cmd" &🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2` around lines 403 - 405, The conditional block that checks "if [[ $i -eq 0 ]]" inside the worker loop is dead because the loop starts at i=1, so OVERLAP_HEAD_AND_WORKER_ARG is never set; remove the entire if-block (the lines setting OVERLAP_HEAD_AND_WORKER_ARG and its fi) from nemo_skills_sandbox_ray.sub.j2, or if the intent was to enable overlap for the head, move/convert that logic to the head-creation branch (where node 0/head is created) instead of inside the worker loop; reference symbols: the loop variable i and OVERLAP_HEAD_AND_WORKER_ARG.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 289-293: get_container_pid() currently returns the first PID from
enroot list which can pick the sandbox container; change it to select the PID
for the ray-head container by filtering the enroot list output by the container
name or image pattern (look for "ray-head" or the Ray image tag used when
launching the head) before extracting the PID, i.e. modify the logic in
get_container_pid to match container identifiers by name/pattern rather than
taking the first numeric PID so subsequent ray status checks and job runs target
the correct container.
---
Nitpick comments:
In
`@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2`:
- Around line 411-413: The conditional checking if [[ $i -eq 0 ]] inside the
loop is dead because the loop starts at i=1; either remove the unreachable
branch or change the check to the correct index (e.g., [[ $i -eq 1 ]]) depending
on intended behavior; update the template snippet that sets
OVERLAP_HEAD_AND_WORKER_ARG (the block containing
OVERLAP_HEAD_AND_WORKER_ARG="--overlap") so it executes for the actual
first-iteration index or is removed if not needed.
In `@nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2`:
- Around line 403-405: The conditional block that checks "if [[ $i -eq 0 ]]"
inside the worker loop is dead because the loop starts at i=1, so
OVERLAP_HEAD_AND_WORKER_ARG is never set; remove the entire if-block (the lines
setting OVERLAP_HEAD_AND_WORKER_ARG and its fi) from
nemo_skills_sandbox_ray.sub.j2, or if the intent was to enable overlap for the
head, move/convert that logic to the head-creation branch (where node 0/head is
created) instead of inside the worker loop; reference symbols: the loop variable
i and OVERLAP_HEAD_AND_WORKER_ARG.
In `@nemo_skills/pipeline/utils/exp.py`:
- Line 611: The inner redefinition of client_num_gpus inside the loop shadows
the earlier variable; rename the inner variable (e.g., to main_task_gpus or
task_client_gpus) where it is assigned with the expression "num_gpus if
(server_config is None or num_nodes > 1) else 0" and update any subsequent uses
(notably the executor call that currently references client_num_gpus) to use the
new name so the outer client_num_gpus remains unchanged and the intent is clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5b927823-65af-48cc-989f-8cdd66446574
📒 Files selected for processing (9)
cluster_configs/example-local.yamldockerfiles/Dockerfile.nemo-rlnemo_skills/__init__.pynemo_skills/pipeline/nemo_rl/grpo.pynemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray.sub.j2nemo_skills/pipeline/nemo_rl/ray_templates/nemo_skills_sandbox_ray_enroot.sub.j2nemo_skills/pipeline/utils/exp.pynemo_skills/training/nemo_rl/configs/sft.yamltests/gpu-tests/test-local.yaml
💤 Files with no reviewable changes (2)
- dockerfiles/Dockerfile.nemo-rl
- nemo_skills/training/nemo_rl/configs/sft.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/pipeline/nemo_rl/grpo.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 611-623: main_env_updates currently sets "SANDBOX_PORT" to
sandbox_port which may be an integer; make it consistently a string like
"GPUS_PER_NODE" by converting sandbox_port to str when updating main_env_updates
(update the block that builds main_env_updates before calling
temporary_env_update; reference variables/symbols main_env_updates,
SANDBOX_PORT, sandbox_port, temporary_env_update, and cluster_config to locate
the change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a959d48a-d304-44e6-9d54-3fa7d8ba200b
📒 Files selected for processing (1)
nemo_skills/pipeline/utils/exp.py
Kipok
left a comment
There was a problem hiding this comment.
lgtm, as long as tests pass. Did we launch slurm test (the omr one) here?
| help="If specified, will reuse the code from this experiment. " | ||
| "Can provide an experiment name or an experiment object if running from code.", | ||
| ), | ||
| config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config"), |
There was a problem hiding this comment.
| config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config"), | |
| config: str = typer.Option(None, help="Override training config YAML; defaults to the upstream container config for nano-v3"), |
| nemo-skills: dockerfile:dockerfiles/Dockerfile.nemo-skills | ||
| verl: dockerfile:dockerfiles/Dockerfile.verl | ||
| nemo-rl: dockerfile:dockerfiles/Dockerfile.nemo-rl | ||
| nemo-rl: nvcr.io/nvidian/nemo-rl:733c7d6-48669320 |
There was a problem hiding this comment.
is this internal container? We ideally want something publicly available here
There was a problem hiding this comment.
It is for nightly container published internally, but not sure if public ones are up to date, I will have a check
There was a problem hiding this comment.
@Kipok I contacted the team and was told public container is released for every 2 months roughly, it seems too old then?
There was a problem hiding this comment.
I guess it's ok to keep it as private as it has a commit in the name. Can you update the readme in dockerfiles folder to mention that it defaults to private container accessible to nvidia employees, but for anyone else they can build it from corresponding dockerfile / commit in nemo-rl repo? But one thing we should make sure is that our ci still works, so can you check what key / setting we need to set there and sync with me offline so that I can update our repo settings and we can test that gpu / gitlab ci passes
| @@ -0,0 +1,502 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
why do we need two templates? If one was for lax cluster, we can probably not port it as it's not needed anymore
There was a problem hiding this comment.
It is for lax, I keep it here in case one day we need for lax types. But I can delete it now
not yet, I can test for this. |
894cee3 to
1a725b3
Compare
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
1a725b3 to
0ad7b49
Compare
sorry for late reply here, it seems omr cannot pass. I checked the sft loss curve which looks good to me, but eval results is bad, I will investigate this. |
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Documentation