Skip to content

feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE#2652

Closed
tmckayus wants to merge 1 commit into
NVIDIA:mainfrom
tmckayus:tmckay/cuopt-gpu-sandbox-poc
Closed

feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE#2652
tmckayus wants to merge 1 commit into
NVIDIA:mainfrom
tmckayus:tmckay/cuopt-gpu-sandbox-poc

Conversation

@tmckayus
Copy link
Copy Markdown

@tmckayus tmckayus commented Apr 29, 2026

Summary

Adds a generic GPU variant of the sandbox so agents can run CUDA workloads on an NVIDIA GPU from inside the sandbox. Gated on NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this commit does not include a GPU base Docker file.

The goal is to allow GPU-enabled tools to be used by the agent in the sandbox, without having to stage the tool itself as an external service that the agent connects to. Running a tool service on the local host outside of the sandbox adds unnecessary complexity and overhead.

Changes

  • Adds NEMOCLAW_GPU_MODE=1 opt-in env var that wires --gpu through openshell gateway start and openshell sandbox create, and selects a GPU sandbox base image (ghcr.io/nvidia/nemoclaw/sandbox-base-gpu).
  • Sandbox policy is unchanged: GPU device nodes (/dev/nvidia*) and /proc rw are appended at sandbox-create time by OpenShell's enrich_proto_baseline_paths(), so no GPU-specific policy file is needed.
  • Adds preflight check that fails fast when GPU mode is set without an NVIDIA GPU or without the GPU base image cached locally (with build instructions in the error).
  • Test updates for the new helpers and parameterized base-image digest call.
  • No public-image dependency: GPU base image is currently local-only (build with Dockerfile.base.gpu, attached); TODO in code to relax the cache check once it gets published to GHCR.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Additionally tested this with cuOpt and successfully solved LP problems in the sandbox

AI Disclosure

  • [ X] AI-assisted — tool: Cursor

Signed-off-by: Trevor McKay tmgithub1@gmail.com

Summary by CodeRabbit

  • New Features
    • GPU-enabled sandbox mode controlled via environment variable.
    • Automatic selection and pinning of CPU/GPU base images with fallback behavior.
    • Sandbox creation and gateway startup now conditionally request GPU allocation.
    • Preflight checks enforce presence of required GPU base image and provide build guidance if missing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Adds GPU-enabled sandbox mode driven by NEMOCLAW_GPU_MODE, introduces GPU-specific base image/tag constants and helpers, updates base-image digest resolution and Dockerfile ARG patching for CPU/GPU, enforces preflight GPU cache checks, and conditions gateway/sandbox startup to forward --gpu only in GPU mode.

Changes

Cohort / File(s) Summary
GPU Sandbox Support
src/lib/onboard.ts
Introduce GPU sandbox mode (NEMOCLAW_GPU_MODE); add SANDBOX_BASE_IMAGE_GPU, SANDBOX_BASE_TAG_GPU, isGpuSandboxMode, getSandboxBaseImage, getSandboxBaseTag; update pullAndResolveBaseImageDigest({ gpuMode?: boolean } = {}) to accept gpuMode; pin GPU/CPU base image digests; rewrite ARG BASE_IMAGE for GPU when requested; export new constants/helpers. Attention: digest resolution, Dockerfile ARG patching, and exported API signature changed.
Preflight & Startup Logic
src/lib/onboard.ts (same file cohort)
Add strict local-cache preflight check for GPU base image when GPU mode enabled (throws with build instructions if missing); conditionally forward --gpu to gateway and sandbox creation only when GPU mode is active; fail-fast when GPU mode requested but no NVIDIA GPU detected.
Tests
test/onboard.test.ts
Relax test matcher: locate pullAndResolveBaseImageDigest( via regex search instead of exact token match; update assertion message accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer/CLI
    participant GW as Gateway (startup)
    participant Onboard as onboard.ts
    participant Cache as Local Cache / Docker
    participant Registry as GHCR/Registry
    participant Plugin as NVIDIA Device Plugin

    Dev->>GW: start sandbox with NEMOCLAW_GPU_MODE
    GW->>Onboard: initialize sandbox (gpuMode inferred)
    Onboard->>Plugin: detect GPU presence
    alt GPU present
        Onboard->>Cache: check for GPU base image digest (preflight)
        alt cached
            Onboard->>Onboard: resolve/pin GPU base image digest
        else not cached
            Onboard->>Registry: attempt GHCR digest lookup
            alt resolved
                Onboard->>Onboard: pin GPU image digest
            else failed
                Onboard-->>Dev: throw with build instructions (fail-fast)
            end
        end
        Onboard->>GW: start gateway with `--gpu`
        GW->>Plugin: device plugin allocates GPU
    else no GPU
        Onboard-->>Dev: fail (GPU mode requested but no GPU)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped in code to find a spark,
CUDA dreams lit up the dark,
Base images pinned, ARGs all neat,
GPU flags make startup fleet—
A rabbit cheers for sandboxed art!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main change: introducing an opt-in GPU sandbox mode via the NEMOCLAW_GPU_MODE environment variable, which is the primary feature added across both modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tmckayus
Copy link
Copy Markdown
Author

tmckayus commented Apr 29, 2026

There may be a better approach than using an env var, but this was non-invasive to the CLI and good for a POC. Likewise for the PR -- there may be a better way, but this solves the essential issue for us: getting cuopt into the hands of the agent as a local tool, without the complexity and overhead of cuopt as an external endpoint.

As written, the PR expects a locally build GPU base image named and tagged appropriately (see below)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2684-2699: The recovery path doesn't include the GPU flag so
recovered gateways lose GPU passthrough; update recoverGatewayRuntime (the
function that restarts the gateway) to append "--gpu" to the gwArgs it builds
when isGpuSandboxMode(_gpu) is true (same logic used when constructing gwArgs
for startGatewayWithOptions), ensuring any place that constructs gateway args
for recovery (also the similar block around the 2950-2957 region) uses the
gwArgs variable and checks isGpuSandboxMode(_gpu) before pushing "--gpu".
- Around line 2572-2576: The current check uses runCapture(..., { ignoreError:
true }) which returns stderr text on failure, causing variables like cached and
localCheck to be truthy even when the image is missing; change the logic to test
the command's exit status instead of its captured output: call the docker
inspection using a variant that returns/throws on non-zero exit (e.g., run(...)
or runCapture but inspect its exitCode) and treat any non-zero exit as "image
not present", then set cached/localCheck to true only when the command succeeds
(exit code === 0); update both occurrences that use runCapture for "docker image
inspect" so the GPU fast-fail and local-cache fallback behave correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d8d34d89-2922-4635-af6d-b31f65476a4e

📥 Commits

Reviewing files that changed from the base of the PR and between 7186834 and 27d5d12.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts
Comment on lines 2684 to +2699
const gwArgs = ["--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
// Do NOT pass --gpu here. On DGX Spark (and most GPU hosts), inference is
// routed through a host-side provider (Ollama, vLLM, or cloud API) — the
// sandbox itself does not need direct GPU access. Passing --gpu causes
// Default: do NOT pass --gpu. On DGX Spark (and most GPU hosts), inference
// is routed through a host-side provider (Ollama, vLLM, or cloud API) —
// the sandbox itself does not need direct GPU access. Passing --gpu causes
// FailedPrecondition errors when the gateway's k3s device plugin cannot
// allocate GPUs. See: https://build.nvidia.com/spark/nemoclaw/instructions
//
// Exception: NEMOCLAW_GPU_MODE=1 opts the sandbox into GPU passthrough
// for CUDA workloads that need direct device access. This adds --gpu
// to the gateway so the k3s NVIDIA device plugin can satisfy
// `resources.limits.nvidia.com/gpu: 1`. Requires: CDI spec on the host
// (`nvidia-ctk cdi generate`), NVIDIA Container Toolkit configured as
// default runtime, and a GPU-capable host.
if (isGpuSandboxMode(_gpu)) {
gwArgs.push("--gpu");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Carry the new --gpu flag into recovery gateway starts as well.

This change wires --gpu into startGatewayWithOptions(), but recoverGatewayRuntime() still restarts the gateway without it. In GPU mode, a recovered gateway will come back without GPU allocation support, so later GPU sandbox operations can regress after a restart.

Suggested fix
-  const startResult = runOpenshell(
-    ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)],
+  const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
+  if (isGpuSandboxMode(nim.detectGpu())) {
+    recoveryArgs.push("--gpu");
+  }
+  const startResult = runOpenshell(
+    recoveryArgs,
     {
       ignoreError: true,
       env: getGatewayStartEnv(),

Also applies to: 2950-2957

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2684 - 2699, The recovery path doesn't
include the GPU flag so recovered gateways lose GPU passthrough; update
recoverGatewayRuntime (the function that restarts the gateway) to append "--gpu"
to the gwArgs it builds when isGpuSandboxMode(_gpu) is true (same logic used
when constructing gwArgs for startGatewayWithOptions), ensuring any place that
constructs gateway args for recovery (also the similar block around the
2950-2957 region) uses the gwArgs variable and checks isGpuSandboxMode(_gpu)
before pushing "--gpu".

@tmckayus
Copy link
Copy Markdown
Author

This is a Dockerfile.base.gpu docker file that I used to construct a generic GPU base image and a short doc on how to use this. The command I used to build the GPU base image is

DOCKER_BUILDKIT=1 docker build --pull
-f Dockerfile.base.gpu
-t ghcr.io/nvidia/nemoclaw/sandbox-base-gpu:latest .

GPU-QUICKSTART.md

dockerfile.zip

@wscurran wscurran added enhancement New feature or request NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 29, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that adds a generic GPU variant of the sandbox, allowing agents to run CUDA workloads on an NVIDIA GPU from inside the sandbox. This change introduces an opt-in environment variable, NEMOCLAW_GPU_MODE, which enables the GPU sandbox mode.

Adds a generic GPU variant of the sandbox so agents can run CUDA
workloads on an NVIDIA GPU from inside the sandbox. Gated on
NEMOCLAW_GPU_MODE=1 so the default CPU path is unchanged. Note: this
commit does not include a GPU base Docker file.

Signed-off-by: Trevor McKay <tmgithub1@gmail.com>
@tmckayus tmckayus force-pushed the tmckay/cuopt-gpu-sandbox-poc branch from 27d5d12 to b46b686 Compare April 29, 2026 19:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

3015-3029: ⚠️ Potential issue | 🟠 Major

Recovery restart still drops GPU mode after gateway recovery.

startGatewayWithOptions() correctly appends --gpu (Lines 3027-3029), but recovery restart still uses fixed args without it (Line 3281). In GPU mode, a recovered gateway can come back without GPU allocation support.

Suggested fix
 async function recoverGatewayRuntime() {
   runOpenshell(["gateway", "select", GATEWAY_NAME], { ignoreError: true });
   let status = runCaptureOpenshell(["status"], { ignoreError: true });
   if (status.includes("Connected") && isSelectedGateway(status)) {
     process.env.OPENSHELL_GATEWAY = GATEWAY_NAME;
     return true;
   }
+  const gpu = nim.detectGpu();
+  const recoveryArgs = ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)];
+  if (isGpuSandboxMode(gpu)) {
+    recoveryArgs.push("--gpu");
+  }
 
   const startResult = runOpenshell(
-    ["gateway", "start", "--name", GATEWAY_NAME, "--port", String(GATEWAY_PORT)],
+    recoveryArgs,
     {
       ignoreError: true,
       env: getGatewayStartEnv(),
       suppressOutput: true,
     },
   );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3015 - 3029, The recovery restart path
builds gateway args without honoring GPU mode causing recovered gateways to drop
GPU passthrough; update the recovery restart logic to mirror
startGatewayWithOptions by checking isGpuSandboxMode(_gpu) and appending "--gpu"
to the gwArgs used during recovery restart (the same symbol gwArgs and
isGpuSandboxMode should be used), so the reboot/recovery routine constructs the
same arg list as startGatewayWithOptions and preserves GPU passthrough on
restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3015-3029: The recovery restart path builds gateway args without
honoring GPU mode causing recovered gateways to drop GPU passthrough; update the
recovery restart logic to mirror startGatewayWithOptions by checking
isGpuSandboxMode(_gpu) and appending "--gpu" to the gwArgs used during recovery
restart (the same symbol gwArgs and isGpuSandboxMode should be used), so the
reboot/recovery routine constructs the same arg list as startGatewayWithOptions
and preserves GPU passthrough on restart.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28439c5b-2950-49cb-82b2-2ef68c9d0b31

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5d12 and b46b686.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/onboard.test.ts

@tmckayus tmckayus changed the title Add opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE env var feat(onboard): opt-in GPU sandbox mode via NEMOCLAW_GPU_MODE Apr 29, 2026
@prekshivyas
Copy link
Copy Markdown
Contributor

Thanks for the groundwork here, @tmckay — this validated that GPU passthrough through OpenShell works end-to-end with CUDA workloads (cuOpt).

#2799 now auto-enables GPU passthrough when an NVIDIA GPU is detected on the host — no env var needed. nvidia-smi works inside the standard sandbox image with just device passthrough.

That should cover the core use case. Does the sandbox-base-gpu image in this PR include additional packages (CUDA toolkit, cuDNN, etc.) beyond what passthrough provides? If so, that could be a separate follow-up for custom base images.

@prekshivyas prekshivyas self-assigned this May 1, 2026
@tmckayus
Copy link
Copy Markdown
Author

tmckayus commented May 8, 2026

Thanks for the groundwork here, @tmckay — this validated that GPU passthrough through OpenShell works end-to-end with CUDA workloads (cuOpt).

#2799 now auto-enables GPU passthrough when an NVIDIA GPU is detected on the host — no env var needed. nvidia-smi works inside the standard sandbox image with just device passthrough.

That should cover the core use case. Does the sandbox-base-gpu image in this PR include additional packages (CUDA toolkit, cuDNN, etc.) beyond what passthrough provides? If so, that could be a separate follow-up for custom base images.

Hi @prekshivyas, sorry for the delay, other issues :) yes the base image I used included cuda libraries plus cuopt. I saw the updated nemoclaw with the --gpu flag support and gave that a try. I was able to run nvidia-smi as promised, awesome! When I tried to install and use cuopt in the sandbox however I ran into some issues. Could be my GPU -- I'm running on a brand new RTX Blackwell, so maybe something a bit older might yield better results. I'll try again and see where I get to. Thanks!

@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented May 8, 2026

Closing as superseded by #2799, which merged the supported nemoclaw onboard --gpu path for GPU passthrough and covers the core GPU sandbox mode without NEMOCLAW_GPU_MODE. Thanks for validating the original cuOpt/GPU workflow here, @tmckayus.

The remaining gap you called out around CUDA/cuOpt libraries in the sandbox image sounds like a narrower custom GPU base-image/package follow-up. Please open a fresh issue or PR for that piece so it can be reviewed separately from the now-landed passthrough support.

@wscurran wscurran closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants