feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE#2652
feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE#2652tmckayus wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughAdds GPU-enabled sandbox mode driven by NEMOCLAW_GPU_MODE, introduces GPU-specific base image/tag constants and helpers, updates base-image digest resolution and Dockerfile ARG patching for CPU/GPU, enforces preflight GPU cache checks, and conditions gateway/sandbox startup to forward Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer/CLI
participant GW as Gateway (startup)
participant Onboard as onboard.ts
participant Cache as Local Cache / Docker
participant Registry as GHCR/Registry
participant Plugin as NVIDIA Device Plugin
Dev->>GW: start sandbox with NEMOCLAW_GPU_MODE
GW->>Onboard: initialize sandbox (gpuMode inferred)
Onboard->>Plugin: detect GPU presence
alt GPU present
Onboard->>Cache: check for GPU base image digest (preflight)
alt cached
Onboard->>Onboard: resolve/pin GPU base image digest
else not cached
Onboard->>Registry: attempt GHCR digest lookup
alt resolved
Onboard->>Onboard: pin GPU image digest
else failed
Onboard-->>Dev: throw with build instructions (fail-fast)
end
end
Onboard->>GW: start gateway with `--gpu`
GW->>Plugin: device plugin allocates GPU
else no GPU
Onboard-->>Dev: fail (GPU mode requested but no GPU)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
There may be a better approach than using an env var, but this was non-invasive to the CLI and good for a POC. Likewise for the PR -- there may be a better way, but this solves the essential issue for us: getting cuopt into the hands of the agent as a local tool, without the complexity and overhead of cuopt as an external endpoint. As written, the PR expects a locally build GPU base image named and tagged appropriately (see below) |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2684-2699: The recovery path doesn't include the GPU flag so
recovered gateways lose GPU passthrough; update recoverGatewayRuntime (the
function that restarts the gateway) to append "--gpu" to the gwArgs it builds
when isGpuSandboxMode(_gpu) is true (same logic used when constructing gwArgs
for startGatewayWithOptions), ensuring any place that constructs gateway args
for recovery (also the similar block around the 2950-2957 region) uses the
gwArgs variable and checks isGpuSandboxMode(_gpu) before pushing "--gpu".
- Around line 2572-2576: The current check uses runCapture(..., { ignoreError:
true }) which returns stderr text on failure, causing variables like cached and
localCheck to be truthy even when the image is missing; change the logic to test
the command's exit status instead of its captured output: call the docker
inspection using a variant that returns/throws on non-zero exit (e.g., run(...)
or runCapture but inspect its exitCode) and treat any non-zero exit as "image
not present", then set cached/localCheck to true only when the command succeeds
(exit code === 0); update both occurrences that use runCapture for "docker image
inspect" so the GPU fast-fail and local-cache fallback behave correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d8d34d89-2922-4635-af6d-b31f65476a4e
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard.test.ts
| const gwArgs = ["--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)]; | ||
| // Do NOT pass --gpu here. On DGX Spark (and most GPU hosts), inference is | ||
| // routed through a host-side provider (Ollama, vLLM, or cloud API) — the | ||
| // sandbox itself does not need direct GPU access. Passing --gpu causes | ||
| // Default: do NOT pass --gpu. On DGX Spark (and most GPU hosts), inference | ||
| // is routed through a host-side provider (Ollama, vLLM, or cloud API) — | ||
| // the sandbox itself does not need direct GPU access. Passing --gpu causes | ||
| // FailedPrecondition errors when the gateway's k3s device plugin cannot | ||
| // allocate GPUs. See: https://build.nvidia.com/spark/nemoclaw/instructions | ||
| // | ||
| // Exception: NEMOCLAW_GPU_MODE=1 opts the sandbox into GPU passthrough | ||
| // for CUDA workloads that need direct device access. This adds --gpu | ||
| // to the gateway so the k3s NVIDIA device plugin can satisfy | ||
| // `resources.limits.nvidia.com/gpu: 1`. Requires: CDI spec on the host | ||
| // (`nvidia-ctk cdi generate`), NVIDIA Container Toolkit configured as | ||
| // default runtime, and a GPU-capable host. | ||
| if (isGpuSandboxMode(_gpu)) { | ||
| gwArgs.push("--gpu"); | ||
| } |
There was a problem hiding this comment.
Carry the new --gpu flag into recovery gateway starts as well.
This change wires --gpu into startGatewayWithOptions(), but recoverGatewayRuntime() still restarts the gateway without it. In GPU mode, a recovered gateway will come back without GPU allocation support, so later GPU sandbox operations can regress after a restart.
Suggested fix
- const startResult = runOpenshell(
- ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)],
+ const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
+ if (isGpuSandboxMode(nim.detectGpu())) {
+ recoveryArgs.push("--gpu");
+ }
+ const startResult = runOpenshell(
+ recoveryArgs,
{
ignoreError: true,
env: getGatewayStartEnv(),Also applies to: 2950-2957
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard.ts` around lines 2684 - 2699, The recovery path doesn't
include the GPU flag so recovered gateways lose GPU passthrough; update
recoverGatewayRuntime (the function that restarts the gateway) to append "--gpu"
to the gwArgs it builds when isGpuSandboxMode(_gpu) is true (same logic used
when constructing gwArgs for startGatewayWithOptions), ensuring any place that
constructs gateway args for recovery (also the similar block around the
2950-2957 region) uses the gwArgs variable and checks isGpuSandboxMode(_gpu)
before pushing "--gpu".
|
This is a Dockerfile.base.gpu docker file that I used to construct a generic GPU base image and a short doc on how to use this. The command I used to build the GPU base image is DOCKER_BUILDKIT=1 docker build --pull |
|
✨ Thanks for submitting this PR that adds a generic GPU variant of the sandbox, allowing agents to run CUDA workloads on an NVIDIA GPU from inside the sandbox. This change introduces an opt-in environment variable, NEMOCLAW_GPU_MODE, which enables the GPU sandbox mode. |
Adds a generic GPU variant of the sandbox so agents can run CUDA workloads on an NVIDIA GPU from inside the sandbox. Gated on NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this commit does not include a GPU base Docker file. Signed-off-by: Trevor McKay <tmgithub1@gmail.com>
27d5d12 to
b46b686
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
3015-3029:⚠️ Potential issue | 🟠 MajorRecovery restart still drops GPU mode after gateway recovery.
startGatewayWithOptions()correctly appends--gpu(Lines 3027-3029), but recovery restart still uses fixed args without it (Line 3281). In GPU mode, a recovered gateway can come back without GPU allocation support.Suggested fix
async function recoverGatewayRuntime() { runOpenshell(["gateway", "select", GATEWAY_NAME], { ignoreError: true }); let status = runCaptureOpenshell(["status"], { ignoreError: true }); if (status.includes("Connected") && isSelectedGateway(status)) { process.env.OPENSHELL_GATEWAY = GATEWAY_NAME; return true; } + const gpu = nim.detectGpu(); + const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)]; + if (isGpuSandboxMode(gpu)) { + recoveryArgs.push("--gpu"); + } const startResult = runOpenshell( - ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)], + recoveryArgs, { ignoreError: true, env: getGatewayStartEnv(), suppressOutput: true, }, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 3015 - 3029, The recovery restart path builds gateway args without honoring GPU mode causing recovered gateways to drop GPU passthrough; update the recovery restart logic to mirror startGatewayWithOptions by checking isGpuSandboxMode(_gpu) and appending "--gpu" to the gwArgs used during recovery restart (the same symbol gwArgs and isGpuSandboxMode should be used), so the reboot/recovery routine constructs the same arg list as startGatewayWithOptions and preserves GPU passthrough on restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3015-3029: The recovery restart path builds gateway args without
honoring GPU mode causing recovered gateways to drop GPU passthrough; update the
recovery restart logic to mirror startGatewayWithOptions by checking
isGpuSandboxMode(_gpu) and appending "--gpu" to the gwArgs used during recovery
restart (the same symbol gwArgs and isGpuSandboxMode should be used), so the
reboot/recovery routine constructs the same arg list as startGatewayWithOptions
and preserves GPU passthrough on restart.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 28439c5b-2950-49cb-82b2-2ef68c9d0b31
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/onboard.test.ts
|
Thanks for the groundwork here, @tmckay — this validated that GPU passthrough through OpenShell works end-to-end with CUDA workloads (cuOpt). #2799 now auto-enables GPU passthrough when an NVIDIA GPU is detected on the host — no env var needed. That should cover the core use case. Does the |
Hi @prekshivyas, sorry for the delay, other issues :) yes the base image I used included cuda libraries plus cuopt. I saw the updated nemoclaw with the --gpu flag support and gave that a try. I was able to run nvidia-smi as promised, awesome! When I tried to install and use cuopt in the sandbox however I ran into some issues. Could be my GPU -- I'm running on a brand new RTX Blackwell, so maybe something a bit older might yield better results. I'll try again and see where I get to. Thanks! |
|
Closing as superseded by #2799, which merged the supported The remaining gap you called out around CUDA/cuOpt libraries in the sandbox image sounds like a narrower custom GPU base-image/package follow-up. Please open a fresh issue or PR for that piece so it can be reviewed separately from the now-landed passthrough support. |
Summary
Adds a generic GPU variant of the sandbox so agents can run CUDA workloads on an NVIDIA GPU from inside the sandbox. Gated on NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this commit does not include a GPU base Docker file.
The goal is to allow GPU-enabled tools to be used by the agent in the sandbox, without having to stage the tool itself as an external service that the agent connects to. Running a tool service on the local host outside of the sandbox adds unnecessary complexity and overhead.
Changes
NEMOCLAW_GPU_MODE=1opt-in env var that wires--gputhroughopenshell gateway startandopenshell sandbox create, and selects a GPU sandbox base image (ghcr.io/nvidia/nemoclaw/sandbox-base-gpu)./dev/nvidia*) and/proc rware appended at sandbox-create time by OpenShell'senrich_proto_baseline_paths(), so no GPU-specific policy file is needed.Dockerfile.base.gpu, attached); TODO in code to relax the cache check once it gets published to GHCR.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Additionally tested this with cuOpt and successfully solved LP problems in the sandbox
AI Disclosure
Signed-off-by: Trevor McKay tmgithub1@gmail.com
Summary by CodeRabbit