feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE by tmckayus · Pull Request #2652 · NVIDIA/NemoClaw

tmckayus · 2026-04-29T01:30:20Z

Summary

Adds a generic GPU variant of the sandbox so agents can run CUDA workloads on an NVIDIA GPU from inside the sandbox. Gated on NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this commit does not include a GPU base Docker file.

The goal is to allow GPU-enabled tools to be used by the agent in the sandbox, without having to stage the tool itself as an external service that the agent connects to. Running a tool service on the local host outside of the sandbox adds unnecessary complexity and overhead.

Changes

Adds NEMOCLAW_GPU_MODE=1 opt-in env var that wires --gpu through openshell gateway start and openshell sandbox create, and selects a GPU sandbox base image (ghcr.io/nvidia/nemoclaw/sandbox-base-gpu).
Sandbox policy is unchanged: GPU device nodes (/dev/nvidia*) and /proc rw are appended at sandbox-create time by OpenShell's enrich_proto_baseline_paths(), so no GPU-specific policy file is needed.
Adds preflight check that fails fast when GPU mode is set without an NVIDIA GPU or without the GPU base image cached locally (with build instructions in the error).
Test updates for the new helpers and parameterized base-image digest call.
No public-image dependency: GPU base image is currently local-only (build with Dockerfile.base.gpu, attached); TODO in code to relax the cache check once it gets published to GHCR.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Additionally tested this with cuOpt and successfully solved LP problems in the sandbox

AI Disclosure

[ X] AI-assisted — tool: Cursor

Signed-off-by: Trevor McKay tmgithub1@gmail.com

Summary by CodeRabbit

New Features
- GPU-enabled sandbox mode controlled via environment variable.
- Automatic selection and pinning of CPU/GPU base images with fallback behavior.
- Sandbox creation and gateway startup now conditionally request GPU allocation.
- Preflight checks enforce presence of required GPU base image and provide build guidance if missing.

coderabbitai · 2026-04-29T01:30:32Z

📝 Walkthrough

Walkthrough

Adds GPU-enabled sandbox mode driven by NEMOCLAW_GPU_MODE, introduces GPU-specific base image/tag constants and helpers, updates base-image digest resolution and Dockerfile ARG patching for CPU/GPU, enforces preflight GPU cache checks, and conditions gateway/sandbox startup to forward --gpu only in GPU mode.

Changes

Cohort / File(s)	Summary
GPU Sandbox Support `src/lib/onboard.ts`	Introduce GPU sandbox mode (`NEMOCLAW_GPU_MODE`); add `SANDBOX_BASE_IMAGE_GPU`, `SANDBOX_BASE_TAG_GPU`, `isGpuSandboxMode`, `getSandboxBaseImage`, `getSandboxBaseTag`; update `pullAndResolveBaseImageDigest({ gpuMode?: boolean } = {})` to accept `gpuMode`; pin GPU/CPU base image digests; rewrite `ARG BASE_IMAGE` for GPU when requested; export new constants/helpers. Attention: digest resolution, Dockerfile ARG patching, and exported API signature changed.
Preflight & Startup Logic `src/lib/onboard.ts` (same file cohort)	Add strict local-cache preflight check for GPU base image when GPU mode enabled (throws with build instructions if missing); conditionally forward `--gpu` to gateway and sandbox creation only when GPU mode is active; fail-fast when GPU mode requested but no NVIDIA GPU detected.
Tests `test/onboard.test.ts`	Relax test matcher: locate `pullAndResolveBaseImageDigest(` via regex search instead of exact token match; update assertion message accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CLI
    participant GW as Gateway (startup)
    participant Onboard as onboard.ts
    participant Cache as Local Cache / Docker
    participant Registry as GHCR/Registry
    participant Plugin as NVIDIA Device Plugin

    Dev->>GW: start sandbox with NEMOCLAW_GPU_MODE
    GW->>Onboard: initialize sandbox (gpuMode inferred)
    Onboard->>Plugin: detect GPU presence
    alt GPU present
        Onboard->>Cache: check for GPU base image digest (preflight)
        alt cached
            Onboard->>Onboard: resolve/pin GPU base image digest
        else not cached
            Onboard->>Registry: attempt GHCR digest lookup
            alt resolved
                Onboard->>Onboard: pin GPU image digest
            else failed
                Onboard-->>Dev: throw with build instructions (fail-fast)
            end
        end
        Onboard->>GW: start gateway with `--gpu`
        GW->>Plugin: device plugin allocates GPU
    else no GPU
        Onboard-->>Dev: fail (GPU mode requested but no GPU)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped in code to find a spark,
CUDA dreams lit up the dark,
Base images pinned, ARGs all neat,
GPU flags make startup fleet—
A rabbit cheers for sandboxed art!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the main change: introducing an opt-in GPU sandbox mode via the NEMOCLAW_GPU_MODE environment variable, which is the primary feature added across both modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tmckayus · 2026-04-29T01:33:35Z

There may be a better approach than using an env var, but this was non-invasive to the CLI and good for a POC. Likewise for the PR -- there may be a better way, but this solves the essential issue for us: getting cuopt into the hands of the agent as a local tool, without the complexity and overhead of cuopt as an external endpoint.

As written, the PR expects a locally build GPU base image named and tagged appropriately (see below)

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2684-2699: The recovery path doesn't include the GPU flag so
recovered gateways lose GPU passthrough; update recoverGatewayRuntime (the
function that restarts the gateway) to append "--gpu" to the gwArgs it builds
when isGpuSandboxMode(_gpu) is true (same logic used when constructing gwArgs
for startGatewayWithOptions), ensuring any place that constructs gateway args
for recovery (also the similar block around the 2950-2957 region) uses the
gwArgs variable and checks isGpuSandboxMode(_gpu) before pushing "--gpu".
- Around line 2572-2576: The current check uses runCapture(..., { ignoreError:
true }) which returns stderr text on failure, causing variables like cached and
localCheck to be truthy even when the image is missing; change the logic to test
the command's exit status instead of its captured output: call the docker
inspection using a variant that returns/throws on non-zero exit (e.g., run(...)
or runCapture but inspect its exitCode) and treat any non-zero exit as "image
not present", then set cached/localCheck to true only when the command succeeds
(exit code === 0); update both occurrences that use runCapture for "docker image
inspect" so the GPU fast-fail and local-cache fallback behave correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d8d34d89-2922-4635-af6d-b31f65476a4e

📥 Commits

Reviewing files that changed from the base of the PR and between 7186834 and 27d5d12.

📒 Files selected for processing (2)

src/lib/onboard.ts
test/onboard.test.ts

coderabbitai · 2026-04-29T01:34:48Z

  const gwArgs = ["--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
-  // Do NOT pass --gpu here. On DGX Spark (and most GPU hosts), inference is
-  // routed through a host-side provider (Ollama, vLLM, or cloud API) — the
-  // sandbox itself does not need direct GPU access. Passing --gpu causes
+  // Default: do NOT pass --gpu. On DGX Spark (and most GPU hosts), inference
+  // is routed through a host-side provider (Ollama, vLLM, or cloud API) —
+  // the sandbox itself does not need direct GPU access. Passing --gpu causes
  // FailedPrecondition errors when the gateway's k3s device plugin cannot
  // allocate GPUs. See: https://build.nvidia.com/spark/nemoclaw/instructions
+  //
+  // Exception: NEMOCLAW_GPU_MODE=1 opts the sandbox into GPU passthrough
+  // for CUDA workloads that need direct device access. This adds --gpu
+  // to the gateway so the k3s NVIDIA device plugin can satisfy
+  // `resources.limits.nvidia.com/gpu: 1`. Requires: CDI spec on the host
+  // (`nvidia-ctk cdi generate`), NVIDIA Container Toolkit configured as
+  // default runtime, and a GPU-capable host.
+  if (isGpuSandboxMode(_gpu)) {
+    gwArgs.push("--gpu");
+  }


⚠️ Potential issue | 🟠 Major

Carry the new --gpu flag into recovery gateway starts as well.

This change wires --gpu into startGatewayWithOptions(), but recoverGatewayRuntime() still restarts the gateway without it. In GPU mode, a recovered gateway will come back without GPU allocation support, so later GPU sandbox operations can regress after a restart.

Suggested fix

- const startResult = runOpenshell( - ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)], + const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)]; + if (isGpuSandboxMode(nim.detectGpu())) { + recoveryArgs.push("--gpu"); + } + const startResult = runOpenshell( + recoveryArgs, { ignoreError: true, env: getGatewayStartEnv(),

Also applies to: 2950-2957

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2684 - 2699, The recovery path doesn't include the GPU flag so recovered gateways lose GPU passthrough; update recoverGatewayRuntime (the function that restarts the gateway) to append "--gpu" to the gwArgs it builds when isGpuSandboxMode(_gpu) is true (same logic used when constructing gwArgs for startGatewayWithOptions), ensuring any place that constructs gateway args for recovery (also the similar block around the 2950-2957 region) uses the gwArgs variable and checks isGpuSandboxMode(_gpu) before pushing "--gpu".

tmckayus · 2026-04-29T01:38:47Z

This is a Dockerfile.base.gpu docker file that I used to construct a generic GPU base image and a short doc on how to use this. The command I used to build the GPU base image is

DOCKER_BUILDKIT=1 docker build --pull
-f Dockerfile.base.gpu
-t ghcr.io/nvidia/nemoclaw/sandbox-base-gpu:latest .

GPU-QUICKSTART.md

dockerfile.zip

wscurran · 2026-04-29T13:56:42Z

✨ Thanks for submitting this PR that adds a generic GPU variant of the sandbox, allowing agents to run CUDA workloads on an NVIDIA GPU from inside the sandbox. This change introduces an opt-in environment variable, NEMOCLAW_GPU_MODE, which enables the GPU sandbox mode.

Adds a generic GPU variant of the sandbox so agents can run CUDA workloads on an NVIDIA GPU from inside the sandbox. Gated on NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this commit does not include a GPU base Docker file. Signed-off-by: Trevor McKay <tmgithub1@gmail.com>

coderabbitai

♻️ Duplicate comments (1)

src/lib/onboard.ts (1)

3015-3029: ⚠️ Potential issue | 🟠 Major

Recovery restart still drops GPU mode after gateway recovery.

startGatewayWithOptions() correctly appends --gpu (Lines 3027-3029), but recovery restart still uses fixed args without it (Line 3281). In GPU mode, a recovered gateway can come back without GPU allocation support.

Suggested fix

 async function recoverGatewayRuntime() {
   runOpenshell(["gateway", "select", GATEWAY_NAME], { ignoreError: true });
   let status = runCaptureOpenshell(["status"], { ignoreError: true });
   if (status.includes("Connected") && isSelectedGateway(status)) {
     process.env.OPENSHELL_GATEWAY = GATEWAY_NAME;
     return true;
   }
+  const gpu = nim.detectGpu();
+  const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
+  if (isGpuSandboxMode(gpu)) {
+    recoveryArgs.push("--gpu");
+  }
 
   const startResult = runOpenshell(
-    ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)],
+    recoveryArgs,
     {
       ignoreError: true,
       env: getGatewayStartEnv(),
       suppressOutput: true,
     },
   );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3015 - 3029, The recovery restart path
builds gateway args without honoring GPU mode causing recovered gateways to drop
GPU passthrough; update the recovery restart logic to mirror
startGatewayWithOptions by checking isGpuSandboxMode(_gpu) and appending "--gpu"
to the gwArgs used during recovery restart (the same symbol gwArgs and
isGpuSandboxMode should be used), so the reboot/recovery routine constructs the
same arg list as startGatewayWithOptions and preserves GPU passthrough on
restart.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3015-3029: The recovery restart path builds gateway args without
honoring GPU mode causing recovered gateways to drop GPU passthrough; update the
recovery restart logic to mirror startGatewayWithOptions by checking
isGpuSandboxMode(_gpu) and appending "--gpu" to the gwArgs used during recovery
restart (the same symbol gwArgs and isGpuSandboxMode should be used), so the
reboot/recovery routine constructs the same arg list as startGatewayWithOptions
and preserves GPU passthrough on restart.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28439c5b-2950-49cb-82b2-2ef68c9d0b31

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5d12 and b46b686.

📒 Files selected for processing (2)

src/lib/onboard.ts
test/onboard.test.ts

✅ Files skipped from review due to trivial changes (1)

test/onboard.test.ts

prekshivyas · 2026-05-01T19:33:48Z

Thanks for the groundwork here, @tmckay — this validated that GPU passthrough through OpenShell works end-to-end with CUDA workloads (cuOpt).

#2799 now auto-enables GPU passthrough when an NVIDIA GPU is detected on the host — no env var needed. nvidia-smi works inside the standard sandbox image with just device passthrough.

That should cover the core use case. Does the sandbox-base-gpu image in this PR include additional packages (CUDA toolkit, cuDNN, etc.) beyond what passthrough provides? If so, that could be a separate follow-up for custom base images.

tmckayus · 2026-05-08T18:14:52Z

Thanks for the groundwork here, @tmckay — this validated that GPU passthrough through OpenShell works end-to-end with CUDA workloads (cuOpt).

#2799 now auto-enables GPU passthrough when an NVIDIA GPU is detected on the host — no env var needed. nvidia-smi works inside the standard sandbox image with just device passthrough.

That should cover the core use case. Does the sandbox-base-gpu image in this PR include additional packages (CUDA toolkit, cuDNN, etc.) beyond what passthrough provides? If so, that could be a separate follow-up for custom base images.

Hi @prekshivyas, sorry for the delay, other issues :) yes the base image I used included cuda libraries plus cuopt. I saw the updated nemoclaw with the --gpu flag support and gave that a try. I was able to run nvidia-smi as promised, awesome! When I tried to install and use cuopt in the sandbox however I ran into some issues. Could be my GPU -- I'm running on a brand new RTX Blackwell, so maybe something a bit older might yield better results. I'll try again and see where I get to. Thanks!

wscurran · 2026-05-08T19:00:14Z

Closing as superseded by #2799, which merged the supported nemoclaw onboard --gpu path for GPU passthrough and covers the core GPU sandbox mode without NEMOCLAW_GPU_MODE. Thanks for validating the original cuOpt/GPU workflow here, @tmckayus.

The remaining gap you called out around CUDA/cuOpt libraries in the sandbox image sounds like a narrower custom GPU base-image/package follow-up. Please open a fresh issue or PR for that piece so it can be reviewed separately from the now-landed passthrough support.

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

wscurran added enhancement New feature or request NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 29, 2026

tmckayus force-pushed the tmckay/cuopt-gpu-sandbox-poc branch from 27d5d12 to b46b686 Compare April 29, 2026 19:24

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

tmckayus changed the title ~~Add opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE env var~~ feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE Apr 29, 2026

prekshivyas self-assigned this May 1, 2026

wscurran closed this May 8, 2026

Conversation

tmckayus commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

tmckayus commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

tmckayus commented Apr 29, 2026

Uh oh!

wscurran commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

prekshivyas commented May 1, 2026

Uh oh!

tmckayus commented May 8, 2026

Uh oh!

wscurran commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmckayus commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

tmckayus commented Apr 29, 2026 •

edited

Loading