Skip to content

feat(onboard): use OpenShell Docker GPU sandboxes#3001

Merged
ericksoa merged 72 commits into
mainfrom
codex/openshell-docker-gpu-onboard
May 11, 2026
Merged

feat(onboard): use OpenShell Docker GPU sandboxes#3001
ericksoa merged 72 commits into
mainfrom
codex/openshell-docker-gpu-onboard

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented May 4, 2026

Summary

  • route Linux onboarding through the OpenShell Docker-driver gateway instead of the legacy k3s bootstrap path
  • add sandbox GPU env/CLI controls, Docker/NVIDIA CDI preflight, GPU metadata, and direct-GPU policy/proof checks
  • update the GPU e2e path to prove direct sandbox GPU access with nvidia-smi, /proc comm write, and cuInit(0)

Tests

  • bash -n scripts/install-openshell.sh && bash -n test/e2e/test-gpu-e2e.sh
  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/onboard.test.ts src/lib/onboard-command.test.ts test/install-openshell-version-check.test.ts src/lib/inventory-commands.test.ts
  • npm run check:credential-env
  • git diff --check

Summary by CodeRabbit

  • New Features

    • Sandbox GPU controls in onboarding/CLI (enable/disable + device) and OpenShell release-channel selection (stable/dev/auto).
  • Improvements

    • Inventory/status now shows host GPU detection, sandbox GPU enabled/mode/device, and OpenShell driver/version.
    • Installer now downloads/verifies multiple OpenShell assets and reports installed OpenShell version.
    • Docker-driver support for privileged sandbox operations and gateway lifecycle handling.
    • Base-image resolution improved with GLIBC compatibility checks and safer fallback/build behavior.
  • Tests

    • Added/updated GPU e2e checks and broad test coverage for onboarding, CLI, provisioning, and image-resolution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds sandbox GPU support and OpenShell channel-aware installation; integrates a Linux Docker-driver gateway flow; persists and surfaces GPU/OpenShell metadata in registry, status, and inventory; implements a GLIBC-aware sandbox base-image resolver for CI; converts OpenShell installer to multi-asset release downloads; and updates tests including e2e GPU verification.

Changes

Sandbox GPU, Docker-Driver Gateway, and OpenShell Channel

Layer / File(s) Summary
Data Shape
src/lib/registry.ts, src/lib/inventory-commands.ts
Added GPU/OpenShell fields to sandbox types: hostGpuDetected, sandboxGpuEnabled, sandboxGpuMode, sandboxGpuDevice, openshellDriver, openshellVersion; updated SandboxInventoryRow/StatusSandboxRow to include these.
CLI Options / Parsing
src/lib/onboard-command.ts, src/lib/onboard-command.test.ts
Introduced --sandbox-gpu / --no-sandbox-gpu / --sandbox-gpu-device parsing, validation, mutual-exclusion checks, and tests; OnboardCommandOptions gains sandboxGpu and sandboxGpuDevice.
Core GPU Resolution & Policy
src/lib/onboard.ts, test/onboard.test.ts
Added GPU config resolution (resolveSandboxGpuConfig), helpers to build --gpu args, direct-GPU policy YAML, proof command generation, and preflight validation (Docker CDI detection).
Sandbox Create, Reuse & Drift
src/lib/onboard.ts, src/lib/registry.ts
createSandbox(..., sandboxGpuConfig) threaded through onboarding; injects GPU create args, records GPU fields in registry, verifies direct-GPU access post-create and deletes on proof failure, and treats GPU drift as a reuse-breaker.
Linux Docker-Driver Gateway Integration
src/lib/onboard.ts, test/onboard.test.ts
New Docker-driver gateway helpers (start, PID/liveness, endpoint registration, env wiring); gateway start/health/reuse routed through Docker-driver paths; skips legacy k3s container reuse and bypasses DNS-proxy setup in docker-driver mode.
Inventory / Status / CLI Output
src/lib/inventory-commands.ts, src/lib/sandbox-status-action.ts, src/lib/inventory-commands.test.ts, test/cli.test.ts
Inventory and status rows include GPU/OpenShell metadata; textual GPU label changes to sandbox GPU / CPU sandbox; JSON and textual tests updated accordingly.
OpenShell Channel & Multi-asset Installer
scripts/install-openshell.sh, src/lib/onboard.ts, test/install-openshell-version-check.test.ts, test/runner.test.ts
Adds NEMOCLAW_OPENSHELL_CHANNEL (`stable
Sandbox Base Image Resolution & CI
src/lib/sandbox-base-image.ts, src/lib/sandbox-base-image.test.ts, .github/actions/*, .github/workflows/*, src/lib/agent-onboard.ts
New resolver prefers SHA-tagged GHCR images, validates GLIBC via ldd against a minimum, falls back to local build when allowed; CI composites/workflows updated to use resolver; agent onboarding writes resolved base ref into staged Dockerfile.
Tests / E2E Validation
test/onboard.test.ts, test/install-openshell-version-check.test.ts, test/e2e/test-gpu-e2e.sh, various tests
Extended unit tests for new helpers and installer logic; e2e adds sandbox GPU verification (status, nvidia-smi, /proc write, CUDA cuInit(0)); multiple tests force process.platform="darwin" for determinism; checksum download stubs adjusted.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as Onboard CLI
    participant Core as Onboard Core
    participant GPU as GPU Resolver
    participant Openshell as OpenShell (gateway/sandbox)
    participant Registry

    User->>CLI: nemoclaw onboard --sandbox-gpu --sandbox-gpu-device 0
    CLI->>CLI: parse flags (--sandbox-gpu, --sandbox-gpu-device)
    CLI->>Core: runOnboard({ sandboxGpu:"enable", sandboxGpuDevice:"0", ... })
    Core->>GPU: resolveSandboxGpuConfig(detectedGpu, { flag:"enable", device:"0" })
    GPU-->>Core: SandboxGpuConfig { sandboxGpuEnabled:true, sandboxGpuMode:..., sandboxGpuDevice:"0" }
    Core->>Openshell: prepare direct-GPU policy & create sandbox (--gpu --gpu-device 0)
    Openshell->>Openshell: create sandbox, mount GPU device
    Core->>Openshell: verifyDirectSandboxGpu(sandbox)
    Openshell-->>Core: proof result (nvidia-smi, /proc write, cuInit)
    Core->>Registry: registerSandbox(..., gpu fields, openshell version/driver)
    Core-->>User: onboarding complete (sandbox with GPU)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I dug a tunnel, found a gleam,
OpenShell channels, GPU dream,
Gateways hum and policies bake,
CUDA wakes and proofs awake,
A happy rabbit hops—sandbox supreme!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective: adding OpenShell Docker GPU sandboxes to the onboarding flow, which is the primary focus across the substantial changes to onboard.ts, GPU-related CLI/inventory fields, and e2e GPU verification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/openshell-docker-gpu-onboard

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/onboard.test.ts (1)

151-169: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen isOnboardTestInternals for newly required helpers.

Line 151 validates only the legacy helper subset, but newly required helpers (for example those added at Lines 186-205 and used at Lines 243-325) are now assumed callable. If one export is missing, the guard still passes and failures become late ... is not a function errors.

🔧 Suggested patch
 function isOnboardTestInternals(
   value: OnboardTestInternalsCandidate,
 ): value is OnboardTestInternals {
   return (
     value !== null &&
@@
     typeof value.buildSandboxConfigSyncScript === "function" &&
+    typeof value.buildSandboxGpuCreateArgs === "function" &&
+    typeof value.buildDirectGpuPolicyYaml === "function" &&
+    typeof value.buildDirectSandboxGpuProofCommands === "function" &&
@@
     typeof value.getBlueprintMinOpenshellVersion === "function" &&
     typeof value.getBlueprintMaxOpenshellVersion === "function" &&
+    typeof value.getDockerDriverGatewayEnv === "function" &&
+    typeof value.isLinuxDockerDriverGatewayEnabled === "function" &&
+    typeof value.parseDockerCdiSpecDirs === "function" &&
+    typeof value.resolveSandboxGpuConfig === "function" &&
+    typeof value.shouldAllowOpenshellAboveBlueprintMax === "function" &&
@@
     typeof value.shouldRunCompatibleEndpointSandboxSmoke === "function" &&
     typeof value.writeSandboxConfigSyncFile === "function"
   );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard.test.ts` around lines 151 - 169, The type-guard
isOnboardTestInternals only checks the legacy helper subset and must be extended
to include typeof checks for every newly required helper export used later in
the tests; update the isOnboardTestInternals function to assert that each new
helper (the additional functions referenced by the tests beyond the existing
list) exists and is a function so the guard fails early rather than producing
late "is not a function" errors. Locate isOnboardTestInternals and add typeof
checks for each of the new helper symbols introduced elsewhere in the test suite
so the returned boolean accurately reflects the full required API surface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4068-4077: The restart path currently kills any live PID returned
by getDockerDriverGatewayPid() which can accidentally terminate unrelated
processes if the PID was recycled; modify the logic so after confirming
isPidAlive(existingPid) you validate that the live process belongs to the
gateway (e.g., inspect the process command line/executable via
/proc/<pid>/cmdline or a ps call and ensure it contains "openshell-gateway" or
the known gateway binary path), and only then call process.kill(existingPid,
"SIGTERM"); if the check fails, remove/clear the stale pid file (the same PID
file read by getDockerDriverGatewayPid()) and skip the kill attempt so unrelated
processes are not affected.
- Around line 1393-1408: verifyDirectSandboxGpu currently calls
process.exit(...) on failure which kills the process and leaves an orphaned
OpenShell sandbox; instead make failures propagate to the caller (createSandbox)
so it can clean up. Change verifyDirectSandboxGpu to stop calling process.exit
and instead throw an Error (or return a rejected Promise/false if you prefer
async patterns) containing the proof label, status and diagnostic info (use
result.status, proof.label and the compactText(redact(...)) diagnostic), so
createSandbox() can catch it, delete the sandbox, and then fail cleanly; keep
the existing logging (console.error lines and remediation lines) but remove
process.exit and replace it with the thrown/rejected failure.

In `@src/lib/sandbox-status-action.ts`:
- Around line 72-79: The status rendering treats sb.sandboxGpuEnabled as
absent/false and prints "disabled" for older entries that only set
sb.gpuEnabled; update the logic that computes sandboxGpu (currently using
sb.sandboxGpuEnabled) to fallback to sb.gpuEnabled when sandboxGpuEnabled is
undefined (i.e., use sandboxGpuEnabled ?? sb.gpuEnabled or explicit undefined
check) so legacy entries with gpuEnabled: true render as enabled; keep the rest
of the sandboxGpuMode/sandboxGpuDevice logic unchanged (refer to
sb.sandboxGpuEnabled, sb.gpuEnabled, and the sandboxGpu variable).

---

Outside diff comments:
In `@test/onboard.test.ts`:
- Around line 151-169: The type-guard isOnboardTestInternals only checks the
legacy helper subset and must be extended to include typeof checks for every
newly required helper export used later in the tests; update the
isOnboardTestInternals function to assert that each new helper (the additional
functions referenced by the tests beyond the existing list) exists and is a
function so the guard fails early rather than producing late "is not a function"
errors. Locate isOnboardTestInternals and add typeof checks for each of the new
helper symbols introduced elsewhere in the test suite so the returned boolean
accurately reflects the full required API surface.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d605cafe-ac24-43a5-aa80-fbfda7e5c795

📥 Commits

Reviewing files that changed from the base of the PR and between 7d5cdd4 and df4f0e8.

📒 Files selected for processing (11)
  • scripts/install-openshell.sh
  • src/lib/inventory-commands.test.ts
  • src/lib/inventory-commands.ts
  • src/lib/onboard-command.test.ts
  • src/lib/onboard-command.ts
  • src/lib/onboard.ts
  • src/lib/registry.ts
  • src/lib/sandbox-status-action.ts
  • test/e2e/test-gpu-e2e.sh
  • test/install-openshell-version-check.test.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/actions/sandbox/status.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25340341019
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 0 passed, 21 failed, 1 skipped

Job Result
cloud-e2e ❌ failure
cloud-inference-e2e ❌ failure
cloud-onboard-e2e ❌ failure
deployment-services-e2e ❌ failure
diagnostics-e2e ❌ failure
docs-validation-e2e ❌ failure
gpu-e2e ⏭️ skipped
hermes-e2e ❌ failure
inference-routing-e2e ❌ failure
messaging-compatible-endpoint-e2e ❌ failure
messaging-providers-e2e ❌ failure
network-policy-e2e ❌ failure
overlayfs-autofix-e2e ❌ failure
rebuild-hermes-e2e ❌ failure
rebuild-openclaw-e2e ❌ failure
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ❌ failure
snapshot-commands-e2e ❌ failure
token-rotation-e2e ❌ failure
upgrade-stale-sandbox-e2e ❌ failure

Failed jobs: cloud-e2e, cloud-onboard-e2e, cloud-inference-e2e, skill-agent-e2e, docs-validation-e2e, messaging-providers-e2e, messaging-compatible-endpoint-e2e, token-rotation-e2e, sandbox-survival-e2e, hermes-e2e, sandbox-operations-e2e, inference-routing-e2e, network-policy-e2e, deployment-services-e2e, diagnostics-e2e, snapshot-commands-e2e, shields-config-e2e, rebuild-openclaw-e2e, upgrade-stale-sandbox-e2e, rebuild-hermes-e2e, overlayfs-autofix-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

3067-3070: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the PID file still belongs to openshell-gateway.

isDockerDriverGatewayProcessAlive() only checks whether the recorded PID exists. If openshell-gateway.pid is stale and that PID has been recycled, the Linux reuse path can accept an unrelated process as “alive” and skip gateway recreation. Reuse should require both liveness and isDockerDriverGatewayProcess(...), and clear the pid file when identity no longer matches.

Suggested direction
 function isDockerDriverGatewayProcessAlive(): boolean {
   const pid = getDockerDriverGatewayPid();
-  return pid !== null && isPidAlive(pid);
+  if (pid === null || !isPidAlive(pid)) return false;
+  const gatewayBin = resolveOpenShellGatewayBinary();
+  if (!isDockerDriverGatewayProcess(pid, gatewayBin)) {
+    fs.rmSync(getDockerDriverGatewayPidFile(), { force: true });
+    return false;
+  }
+  return true;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3067 - 3070, Update
isDockerDriverGatewayProcessAlive to not only check PID liveness but also verify
process identity using isDockerDriverGatewayProcess(pid); call
getDockerDriverGatewayPid(), if pid is null return false, then if
isPidAlive(pid) is true but isDockerDriverGatewayProcess(pid) is false,
delete/clear the openshell-gateway PID file (use the same mechanism that writes
the PID) and return false; only return true when both isPidAlive(pid) and
isDockerDriverGatewayProcess(pid) are true. Reference functions:
isDockerDriverGatewayProcessAlive, getDockerDriverGatewayPid, isPidAlive,
isDockerDriverGatewayProcess.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3104-3115: The reuse path must only claim gateway reuse if the
endpoint was actually registered: update registerDockerDriverGatewayEndpoint()
so it sets process.env.OPENSHELL_GATEWAY only when both runOpenshell calls
succeeded (addResult.status === 0 && selectResult.status === 0) and return that
boolean; then change the reuse/early-return logic (the places around the
liveness/health check that currently return without checking
registerDockerDriverGatewayEndpoint) to require
registerDockerDriverGatewayEndpoint() to return true before announcing reuse or
exiting early. Use the existing symbols runOpenshell, addResult, selectResult,
GATEWAY_NAME and OPENSHELL_GATEWAY to locate and change the code.
- Around line 919-928: The current check only verifies that CDI directories
exist; instead, after calling getDockerCdiSpecDirs() iterate those directories
and search for at least one readable CDI spec file (match extensions .json or
.yaml/.yml) using fs.readdir/fs.stat (or their sync variants) and handle
permission errors; if no spec files are found, print the same remediation
messages (using sandboxGpuRemediationLines()) and exit(1); if spec files are
found, log the detected spec file paths in the success message (replace
cdiSpecDirs.join(...) with the actual spec file paths) so that the check ensures
Docker has real NVIDIA CDI spec files before declaring GPU support.

In `@test/runner.test.ts`:
- Around line 672-677: The stub that handles
openshell-gateway-checksums-sha256.txt and
openshell-sandbox-checksums-sha256.txt only emits x86_64 asset lines, so add
corresponding aarch64 entries to the printf output for both cases (i.e., emit
lines for 'openshell-gateway-aarch64-unknown-linux-gnu.tar.gz' and
'openshell-sandbox-aarch64-unknown-linux-gnu.tar.gz' in the same format as the
existing 'ignored  openshell-*-x86_64-unknown-linux-gnu.tar.gz' lines) so the
grep-based checksum matching in the installer will find aarch64 entries when
running on Linux aarch64.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3067-3070: Update isDockerDriverGatewayProcessAlive to not only
check PID liveness but also verify process identity using
isDockerDriverGatewayProcess(pid); call getDockerDriverGatewayPid(), if pid is
null return false, then if isPidAlive(pid) is true but
isDockerDriverGatewayProcess(pid) is false, delete/clear the openshell-gateway
PID file (use the same mechanism that writes the PID) and return false; only
return true when both isPidAlive(pid) and isDockerDriverGatewayProcess(pid) are
true. Reference functions: isDockerDriverGatewayProcessAlive,
getDockerDriverGatewayPid, isPidAlive, isDockerDriverGatewayProcess.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c00cc609-0764-4942-9057-fbbbe157098f

📥 Commits

Reviewing files that changed from the base of the PR and between df4f0e8 and 74fe6eb.

📒 Files selected for processing (5)
  • src/lib/onboard.ts
  • src/lib/sandbox-status-action.ts
  • test/cli.test.ts
  • test/onboard.test.ts
  • test/runner.test.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/sandbox-status-action.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts
Comment thread test/runner.test.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

3091-3101: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only mark the Docker-driver gateway reusable after add/select succeeds.

This helper still exports OPENSHELL_GATEWAY even when gateway add/select failed, and the reuse branch still returns without checking the boolean. That can make onboarding announce reuse while later commands target an unregistered or wrong gateway.

Suggested fix
 function registerDockerDriverGatewayEndpoint(): boolean {
   const addResult = runOpenshell(
     ["gateway", "add", "--local", "--name", GATEWAY_NAME, getDockerDriverGatewayEndpoint()],
     { ignoreError: true, suppressOutput: true },
   );
   const selectResult = runOpenshell(["gateway", "select", GATEWAY_NAME], {
     ignoreError: true,
     suppressOutput: true,
   });
-  process.env.OPENSHELL_GATEWAY = GATEWAY_NAME;
-  return addResult.status === 0 && selectResult.status === 0;
+  const ok = addResult.status === 0 && selectResult.status === 0;
+  if (ok) process.env.OPENSHELL_GATEWAY = GATEWAY_NAME;
+  return ok;
 }
   if (
     isDockerDriverGatewayProcessAlive() &&
     isGatewayHealthy(gatewayStatus, gwInfo, activeGatewayInfo)
   ) {
-    console.log("  ✓ Reusing existing Docker-driver gateway");
-    registerDockerDriverGatewayEndpoint();
-    return;
+    if (registerDockerDriverGatewayEndpoint()) {
+      console.log("  ✓ Reusing existing Docker-driver gateway");
+      return;
+    }
   }

Also applies to: 4060-4066

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3091 - 3101, The
registerDockerDriverGatewayEndpoint function sets process.env.OPENSHELL_GATEWAY
and reports success even when runOpenshell add/select fail; change it to only
set OPENSHELL_GATEWAY and return true when both addResult.status === 0 and
selectResult.status === 0, otherwise return false (do not export
OPENSHELL_GATEWAY on failure). Also update the reuse/early-return branch that
announces reuse to check that this boolean success is true before claiming reuse
(apply same change to the analogous implementation around the other occurrence
noted).

925-935: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require a real NVIDIA CDI spec before passing GPU preflight.

Checking only .CDISpecDirs still lets empty CDI directories pass here, so onboarding can report GPU support and then fail later when sandbox create --gpu has no NVIDIA spec to use. Validate that at least one readable NVIDIA .json/.yaml spec exists in those directories and log the resolved spec file paths instead of the directory list.

Suggested fix
   const cdiSpecDirs = getDockerCdiSpecDirs();
-  if (cdiSpecDirs.length === 0) {
+  const cdiSpecFiles = cdiSpecDirs.flatMap((dir) => {
+    try {
+      return fs
+        .readdirSync(dir)
+        .filter((name) => /^nvidia.*\.(json|ya?ml)$/i.test(name))
+        .map((name) => path.join(dir, name));
+    } catch {
+      return [];
+    }
+  });
+  if (cdiSpecFiles.length === 0) {
     console.error("");
     console.error("  ✗ Docker CDI GPU support was not detected.");
     for (const line of sandboxGpuRemediationLines()) {
       console.error(`    ${line}`);
     }
     process.exit(1);
   }
-  console.log(`  ✓ Docker CDI GPU support detected (${cdiSpecDirs.join(", ")})`);
+  console.log(`  ✓ Docker CDI GPU support detected (${cdiSpecFiles.join(", ")})`);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 925 - 935, The current check uses
getDockerCdiSpecDirs() and treats any non-empty directory list as success;
update the code to scan each directory in cdiSpecDirs for readable NVIDIA CDI
spec files (extensions .json, .yaml, .yml), attempting to access and minimally
validate each file (e.g., readable and contains expected NVIDIA/vendor or
runtime fields) and collect their full file paths (suggest adding a helper like
findReadableCdiSpecs or reuse getDockerCdiSpecDirs to return files). If no valid
NVIDIA spec files are found, call sandboxGpuRemediationLines() and
process.exit(1) as before; if found, log the resolved spec file paths (not just
directories) in the success message and use those paths downstream (reference
cdiSpecDirs, sandboxGpuRemediationLines, and the success log).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/actions/resolve-sandbox-base-image/action.yaml:
- Around line 55-57: After building the local fallback image with "docker build
-f Dockerfile.base -t nemoclaw-sandbox-base-local .", run the same glibc_version
ABI check inside that newly built image (e.g., docker run --rm
nemoclaw-sandbox-base-local sh -c '<glibc_version check command>') and only
export BASE_IMAGE=nemoclaw-sandbox-base-local to GITHUB_ENV if the check
succeeds; if the check fails, emit a failure/warning and exit non-zero so
downstream steps don't consume an incompatible base image. Ensure you reference
the built image tag "nemoclaw-sandbox-base-local", the Dockerfile
"Dockerfile.base" and the "BASE_IMAGE" env var in your fix.

In `@src/lib/agent-onboard.ts`:
- Around line 82-99: The code currently treats a null return from
resolveSandboxBaseImage() as if a compatible image exists by inspecting/building
`${baseImageName}:latest` and assigning baseImageRef = baseImageTag; instead,
update the branch handling when resolved is falsy so that
resolveSandboxBaseImage() failure is not silently bypassed: check the platform
(e.g., process.platform) and if on Linux, throw or return an explicit error
(including agent.displayName and baseImageName) instead of inspecting/building
:latest; for non-Linux platforms you may keep the existing
dockerImageInspect/dockerBuild flow but only after documenting/validating that
using baseImageTag is safe. Ensure you modify the code paths that set
baseImageRef, call dockerImageInspect, and call dockerBuild so a resolver
failure does not automatically set baseImageRef = baseImageTag on Linux.

In `@src/lib/onboard.ts`:
- Around line 1382-1383: The GPU-proof loop can hang; add a per-proof timeout
(e.g. 10_000 ms) around each runOpenshell invocation so a wedged probe will be
aborted and treated as failure; implement a small helper (e.g.
runOpenshellWithTimeout) that calls
buildDirectSandboxGpuProofCommands(sandboxName) entries and for each proof
starts runOpenshell(proof.args, { ignoreError: true, suppressOutput: true }),
sets a timer to kill/abort the underlying process (or call whatever runOpenshell
exposes to terminate) if it exceeds the timeout, clears the timer on completion,
and returns a timeout error result so the existing delete-and-fail path can
proceed. Ensure the timeout is per-proof and that the helper preserves the
original flags and error semantics.

In `@src/lib/sandbox-base-image.ts`:
- Around line 139-140: resolvePulledCandidate currently always attempts
dockerPull and treats non-pullable refs as failures even if the image exists
locally; change the logic in resolvePulledCandidate to first check the local
Docker cache for imageRef (use the existing docker inspect/inspectImage or
dockerImageExists helper if present, or call dockerInspect equivalent) and
return a successful result when found, and only call dockerPull(imageRef, {
ignoreError: true, suppressOutput: true }) if the image is not present locally;
keep the existing early return behavior (return null) only when neither local
inspect nor pull succeeds.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3091-3101: The registerDockerDriverGatewayEndpoint function sets
process.env.OPENSHELL_GATEWAY and reports success even when runOpenshell
add/select fail; change it to only set OPENSHELL_GATEWAY and return true when
both addResult.status === 0 and selectResult.status === 0, otherwise return
false (do not export OPENSHELL_GATEWAY on failure). Also update the
reuse/early-return branch that announces reuse to check that this boolean
success is true before claiming reuse (apply same change to the analogous
implementation around the other occurrence noted).
- Around line 925-935: The current check uses getDockerCdiSpecDirs() and treats
any non-empty directory list as success; update the code to scan each directory
in cdiSpecDirs for readable NVIDIA CDI spec files (extensions .json, .yaml,
.yml), attempting to access and minimally validate each file (e.g., readable and
contains expected NVIDIA/vendor or runtime fields) and collect their full file
paths (suggest adding a helper like findReadableCdiSpecs or reuse
getDockerCdiSpecDirs to return files). If no valid NVIDIA spec files are found,
call sandboxGpuRemediationLines() and process.exit(1) as before; if found, log
the resolved spec file paths (not just directories) in the success message and
use those paths downstream (reference cdiSpecDirs, sandboxGpuRemediationLines,
and the success log).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f456db68-980b-477f-aee3-7dfc8ef95aa0

📥 Commits

Reviewing files that changed from the base of the PR and between 1e94f2e and a639877.

📒 Files selected for processing (17)
  • .github/actions/resolve-hermes-base-image/action.yaml
  • .github/actions/resolve-sandbox-base-image/action.yaml
  • .github/workflows/base-image.yaml
  • .github/workflows/docker-pin-check.yaml
  • .github/workflows/pr-self-hosted.yaml
  • Dockerfile
  • Dockerfile.base
  • agents/hermes/Dockerfile.base
  • scripts/brev-launchable-ci-cpu.sh
  • scripts/update-docker-pin.sh
  • src/lib/agent-onboard.ts
  • src/lib/onboard.ts
  • src/lib/sandbox-base-image.test.ts
  • src/lib/sandbox-base-image.ts
  • test/hermes-share-mount-deps.test.ts
  • test/onboard.test.ts
  • test/sandbox-provisioning.test.ts
✅ Files skipped from review due to trivial changes (2)
  • scripts/brev-launchable-ci-cpu.sh
  • src/lib/sandbox-base-image.test.ts

Comment thread .github/actions/resolve-sandbox-base-image/action.yaml
Comment thread src/lib/agent-onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/sandbox-base-image.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25346185379
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 9 passed, 5 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ❌ failure
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ❌ failure
overlayfs-autofix-e2e ⚠️ cancelled
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ✅ success
snapshot-commands-e2e ❌ failure
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: sandbox-survival-e2e, network-policy-e2e, deployment-services-e2e, snapshot-commands-e2e, shields-config-e2e. Check run artifacts for logs.

@ericksoa ericksoa self-assigned this May 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

15-48: Run Dockerfile-focused e2e workflows before merge.

Given these image-layer/runtime dependency changes, I strongly recommend running the targeted nightly e2e jobs (cloud-e2e, sandbox-survival-e2e, hermes-e2e, rebuild-openclaw-e2e) on this branch to validate behavior under real container build/runtime conditions.

As per coding guidelines, Dockerfile: “This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile` around lines 15 - 48, This change updates the runtime Dockerfile
behavior (see BASE_IMAGE usage and the apt-get/apt-mark block that conditionally
installs procps/e2fsprogs, the packages variable logic, ps --version check and
command -v chattr) so run the targeted e2e container workflows before merging:
trigger cloud-e2e, sandbox-survival-e2e, hermes-e2e and rebuild-openclaw-e2e on
this branch, verify the built image boots, that ps and chattr are present as
expected, and confirm no runtime regressions from the apt removal/autoremove and
conditional install logic; if any failures occur, revert or adjust the apt
package pins/conditional branches and retest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Dockerfile`:
- Around line 15-48: This change updates the runtime Dockerfile behavior (see
BASE_IMAGE usage and the apt-get/apt-mark block that conditionally installs
procps/e2fsprogs, the packages variable logic, ps --version check and command -v
chattr) so run the targeted e2e container workflows before merging: trigger
cloud-e2e, sandbox-survival-e2e, hermes-e2e and rebuild-openclaw-e2e on this
branch, verify the built image boots, that ps and chattr are present as
expected, and confirm no runtime regressions from the apt removal/autoremove and
conditional install logic; if any failures occur, revert or adjust the apt
package pins/conditional branches and retest.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3edafedd-f68f-43ce-bb44-ea76f984d80f

📥 Commits

Reviewing files that changed from the base of the PR and between a639877 and dccd4c5.

📒 Files selected for processing (5)
  • Dockerfile
  • Dockerfile.base
  • agents/hermes/Dockerfile.base
  • test/hermes-share-mount-deps.test.ts
  • test/sandbox-provisioning.test.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/hermes-share-mount-deps.test.ts
  • Dockerfile.base
  • agents/hermes/Dockerfile.base
  • test/sandbox-provisioning.test.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25347377465
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 14 passed, 6 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ❌ failure
overlayfs-autofix-e2e ❌ failure
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ✅ success
snapshot-commands-e2e ❌ failure
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: sandbox-survival-e2e, sandbox-operations-e2e, network-policy-e2e, snapshot-commands-e2e, shields-config-e2e, overlayfs-autofix-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25348829097
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 12 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

@ericksoa ericksoa marked this pull request as draft May 4, 2026 23:32
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25349131105
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 13 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lib/snapshot-action.ts (1)

186-201: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the target sandbox driver for post-create setup, not the source driver.

srcEntry.openshellDriver describes the source sandbox, but this path creates a brand-new dstName using the current runtime. After this PR those can diverge, so keying DNS-proxy setup off the source driver can run the wrong post-create flow. It also means the cloned registry metadata below can leave the destination tagged with the wrong driver for later commands. Please resolve the actual driver of dstName after create and use that for target-side setup/registration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/snapshot-action.ts` around lines 186 - 201, The code uses
srcEntry.openshellDriver to decide post-create steps and to register metadata,
but the destination sandbox (dstName) may have a different driver; instead,
probe/resolve the actual driver for dstName after creation (e.g., call the
function that returns the sandbox's runtime/driver) and use that resolved value
for the DNS-proxy conditional (the run([...dnsScript...]) decision) and when
cloning metadata in registry.registerSandbox (set the destination's
openshellDriver/driver field to the resolved driver rather than copying
srcEntry.openshellDriver); update references to srcEntry.openshellDriver to use
the resolved target driver variable so post-create setup and registry metadata
reflect the actual dst runtime.
test/onboard.test.ts (1)

166-194: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep isOnboardTestInternals aligned with the expanded contract.

pullAndResolveBaseImageDigest is declared in OnboardTestInternals (Line 143) and destructured later, but the runtime guard does not validate it. A missing/renamed export would bypass the guard and fail later in a less obvious place.

Suggested fix
 function isOnboardTestInternals(
   value: OnboardTestInternalsCandidate,
 ): value is OnboardTestInternals {
   return (
     value !== null &&
@@
     typeof value.parsePolicyPresetEnv === "function" &&
     typeof value.patchStagedDockerfile === "function" &&
+    typeof value.pullAndResolveBaseImageDigest === "function" &&
     typeof value.writeSandboxConfigSyncFile === "function"
   );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard.test.ts` around lines 166 - 194, The type-guard
isOnboardTestInternals is missing a check for pullAndResolveBaseImageDigest, so
add a runtime validation that typeof value.pullAndResolveBaseImageDigest ===
"function" alongside the other function checks to ensure the destructured export
exists and will not fail later when pullAndResolveBaseImageDigest is used.
🧹 Nitpick comments (1)
test/e2e/test-overlayfs-autofix.sh (1)

132-137: ⚡ Quick win

Use a stable applicability probe instead of grepping onboard.ts.

This skip now depends on the exact source text return platform === "linux";. A harmless refactor in src/lib/onboard.ts will re-enable this E2E path and start mutating /etc/docker/daemon.json again even if Linux onboarding is still Docker-driver-only. Please key this off a runtime flag/helper or another stable capability signal instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-overlayfs-autofix.sh` around lines 132 - 137, Replace the
fragile grep probe with a runtime capability check: export a stable helper
(e.g., isDockerDriverOnlyOnboarding or similar) from src/lib/onboard.ts that
returns whether onboarding is Docker-driver-only, and update the
test/e2e/test-overlayfs-autofix.sh to call that helper at runtime (for example
via node -e
'console.log(require("src/lib/onboard").isDockerDriverOnlyOnboarding())') and
base the skip on the helper's true/false result instead of grepping for the
literal 'return platform === "linux";'.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 1365-1376: The code currently moves /proc into fsPolicy.read_write
which opens the entire procfs; instead, keep /proc in read-only and only add the
minimal required proc path to fsPolicy.read_write (e.g., the exact GPU-related
comm path such as `/proc/self/task/<tid>/comm` or a single specific GPU path
computed by the onboarding logic). Concretely, update the block that populates
fsPolicy.read_write (the readWrite variable and the conditional that pushes
"/proc") to not push "/proc" and to push only the precise path string you need
(compute it where you have the tid/GPU id), leaving parsed.filesystem_policy and
fsPolicy.read_only handling unchanged. Ensure the resulting fsPolicy.read_write
contains only that single allowed proc entry and not the whole "/proc".
- Around line 9580-9585: The code currently calls destroyGateway() and
registry.clearAll() whenever isLinuxDockerDriverGatewayEnabled() is true and
gatewayReuseState !== "missing", which can silently tear down k3s sandboxes;
change the logic in the migration block (the if that references
isLinuxDockerDriverGatewayEnabled and gatewayReuseState) to prevent automatic
destruction: require an explicit user confirmation or a migration flag before
calling destroyGateway() and registry.clearAll(), and implement a safe path that
either backs up existing sandbox metadata (e.g., export or copy registry state)
or skips clearing when the user refuses; keep runOpenshell(["forward", "stop",
String(DASHBOARD_PORT)], { ignoreError: true }) for cleanup but only proceed to
destroyGateway() and registry.clearAll() after confirmation or an explicit
opt-in to destructive migration.
- Around line 4233-4245: The restart branch may kill any openshell-gateway
process that matches the binary but not our runtime; update the check so
isDockerDriverGatewayProcess is invoked with requireDockerDriverEnv: true (or an
equivalent flag) when validating existingPid before deciding to SIGTERM it or
remove the pidfile; locate the logic around
getDockerDriverGatewayPid()/isPidAlive() and replace the current
isDockerDriverGatewayProcess(existingPid, gatewayBin) call with one that ensures
the gateway was started with the Docker-driver environment, leaving
getDockerDriverGatewayPidFile(), gatewayBin and the existingPid logic intact.

In `@src/lib/shields.ts`:
- Around line 47-64: When the registry reports openshellDriver === "docker" but
no matching container is found, fail fast instead of returning null: modify
resolveDockerDriverSandboxContainer to throw a clear Error (including the
sandboxName) when the registry says the sandbox uses the docker driver yet
dockerCapture.find(...) yields no container; update any duplicate/second
implementation in this file the same way so callers (e.g., code that currently
falls back to kubectlExecArgv) will get an actionable "container not found"
error rather than silently using the wrong backend.

In `@test/e2e/test-double-onboard.sh`:
- Around line 82-124: The current logic in gateway_runtime_id() and
stop_gateway_runtime() blindly trusts any live PID from
docker_driver_gateway_pid_file and may kill an unrelated reused PID; change both
functions to verify the PID belongs to the gateway process before acting: when
you read the pid from docker_driver_gateway_pid_file in gateway_runtime_id() and
stop_gateway_runtime(), inspect the process command (e.g., /proc/<pid>/cmdline
or ps output) to ensure it contains the expected gateway/driver marker (like the
openshell gateway/driver binary name or known invocation args) and only then
return or attempt to kill it; if the PID check fails, fall back to detecting the
Docker container (docker ps ... name=openshell-cluster-nemoclaw) and avoid
issuing kill on unverified PIDs. Ensure you reference
docker_driver_gateway_pid_file, gateway_runtime_id, and stop_gateway_runtime
when making the change.

In `@test/e2e/test-sandbox-survival.sh`:
- Around line 174-186: The recovery branch in start_gateway_runtime runs
"nemoclaw \"$SANDBOX_NAME\" status" with no timeout and can hang; wrap that call
with a bounded timeout (e.g., use the timeout command or a background timer) so
the command is killed after a short limit, capture its stdout/stderr into the
existing recovery_log, and treat a timeout as a non-zero failure so the code
falls through to the health-polling path; ensure you still print the last lines
of recovery_log (sed/tail) on failure and always rm -f "$recovery_log" before
returning.
- Around line 130-172: The helper blindly trusts any live PID in
docker_driver_gateway_pid_file; update gateway_runtime_id() and
stop_gateway_runtime() to validate the PID belongs to the gateway process before
acting: after reading pid from docker_driver_gateway_pid_file, check the process
command line (e.g. /proc/$pid/cmdline or ps -p "$pid" -o args=) for an expected
marker such as "openshell gateway" or the specific gateway args used by your
runtime, and only return/kill that pid if the marker matches; if the marker does
not match, ignore the pid and fall back to checking the docker container ID as
currently done (and apply the same validation in stop_gateway_runtime() before
sending TERM/KILL).

---

Outside diff comments:
In `@src/lib/snapshot-action.ts`:
- Around line 186-201: The code uses srcEntry.openshellDriver to decide
post-create steps and to register metadata, but the destination sandbox
(dstName) may have a different driver; instead, probe/resolve the actual driver
for dstName after creation (e.g., call the function that returns the sandbox's
runtime/driver) and use that resolved value for the DNS-proxy conditional (the
run([...dnsScript...]) decision) and when cloning metadata in
registry.registerSandbox (set the destination's openshellDriver/driver field to
the resolved driver rather than copying srcEntry.openshellDriver); update
references to srcEntry.openshellDriver to use the resolved target driver
variable so post-create setup and registry metadata reflect the actual dst
runtime.

In `@test/onboard.test.ts`:
- Around line 166-194: The type-guard isOnboardTestInternals is missing a check
for pullAndResolveBaseImageDigest, so add a runtime validation that typeof
value.pullAndResolveBaseImageDigest === "function" alongside the other function
checks to ensure the destructured export exists and will not fail later when
pullAndResolveBaseImageDigest is used.

---

Nitpick comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 132-137: Replace the fragile grep probe with a runtime capability
check: export a stable helper (e.g., isDockerDriverOnlyOnboarding or similar)
from src/lib/onboard.ts that returns whether onboarding is Docker-driver-only,
and update the test/e2e/test-overlayfs-autofix.sh to call that helper at runtime
(for example via node -e
'console.log(require("src/lib/onboard").isDockerDriverOnlyOnboarding())') and
base the skip on the helper's true/false result instead of grepping for the
literal 'return platform === "linux";'.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52c5ffcc-4502-483f-8d03-83db3a16ca42

📥 Commits

Reviewing files that changed from the base of the PR and between a639877 and ad613cf.

📒 Files selected for processing (18)
  • .github/actions/resolve-hermes-base-image/action.yaml
  • .github/actions/resolve-sandbox-base-image/action.yaml
  • Dockerfile
  • Dockerfile.base
  • agents/hermes/Dockerfile.base
  • src/lib/agent-onboard.ts
  • src/lib/onboard.ts
  • src/lib/sandbox-base-image.ts
  • src/lib/shields.ts
  • src/lib/snapshot-action.ts
  • test/e2e/test-double-onboard.sh
  • test/e2e/test-overlayfs-autofix.sh
  • test/e2e/test-sandbox-survival.sh
  • test/hermes-share-mount-deps.test.ts
  • test/install-preflight.test.ts
  • test/onboard.test.ts
  • test/runner.test.ts
  • test/sandbox-provisioning.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/hermes-share-mount-deps.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • Dockerfile.base
  • Dockerfile
  • test/sandbox-provisioning.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment on lines +1365 to +1376
parsed.filesystem_policy = parsed.filesystem_policy || {};
const fsPolicy = parsed.filesystem_policy;
fsPolicy.read_only = Array.isArray(fsPolicy.read_only)
? fsPolicy.read_only.filter((entry: unknown) => String(entry) !== "/proc")
: [];
const readWrite = Array.isArray(fsPolicy.read_write)
? fsPolicy.read_write.map((entry: unknown) => String(entry))
: [];
if (!readWrite.includes("/proc")) {
readWrite.push("/proc");
}
fsPolicy.read_write = readWrite;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep /proc read-only except for the exact GPU path you need.

This change flips the whole /proc tree from read-only to read-write for every direct-GPU sandbox. The proof here only needs /proc/self/task/<tid>/comm, so widening all of procfs materially weakens the sandbox boundary for a much larger surface than required.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 1365 - 1376, The code currently moves /proc
into fsPolicy.read_write which opens the entire procfs; instead, keep /proc in
read-only and only add the minimal required proc path to fsPolicy.read_write
(e.g., the exact GPU-related comm path such as `/proc/self/task/<tid>/comm` or a
single specific GPU path computed by the onboarding logic). Concretely, update
the block that populates fsPolicy.read_write (the readWrite variable and the
conditional that pushes "/proc") to not push "/proc" and to push only the
precise path string you need (compute it where you have the tid/GPU id), leaving
parsed.filesystem_policy and fsPolicy.read_only handling unchanged. Ensure the
resulting fsPolicy.read_write contains only that single allowed proc entry and
not the whole "/proc".

Comment thread src/lib/onboard.ts
Comment on lines +4233 to +4245
const existingPid = getDockerDriverGatewayPid() ?? portListenerPid;
if (existingPid !== null && isPidAlive(existingPid)) {
if (!isDockerDriverGatewayProcess(existingPid, gatewayBin)) {
fs.rmSync(getDockerDriverGatewayPidFile(), { force: true });
} else {
console.log(` Restarting unhealthy Docker-driver gateway process (PID ${existingPid})...`);
try {
process.kill(existingPid, "SIGTERM");
sleep(1);
} catch {
/* best effort; the new process will surface any remaining port conflict */
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only SIGTERM processes proven to be the Docker-driver gateway.

This restart branch checks isDockerDriverGatewayProcess(existingPid, gatewayBin) without requireDockerDriverEnv: true, so any other openshell-gateway process using the same binary can be killed here. On hosts that also have a packaged or legacy gateway service, onboarding can terminate the wrong process.

Suggested fix
-  if (existingPid !== null && isPidAlive(existingPid)) {
-    if (!isDockerDriverGatewayProcess(existingPid, gatewayBin)) {
+  if (existingPid !== null && isPidAlive(existingPid)) {
+    if (
+      !isDockerDriverGatewayProcess(existingPid, gatewayBin, {
+        requireDockerDriverEnv: true,
+      })
+    ) {
       fs.rmSync(getDockerDriverGatewayPidFile(), { force: true });
     } else {
       console.log(`  Restarting unhealthy Docker-driver gateway process (PID ${existingPid})...`);
       try {
         process.kill(existingPid, "SIGTERM");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4233 - 4245, The restart branch may kill any
openshell-gateway process that matches the binary but not our runtime; update
the check so isDockerDriverGatewayProcess is invoked with
requireDockerDriverEnv: true (or an equivalent flag) when validating existingPid
before deciding to SIGTERM it or remove the pidfile; locate the logic around
getDockerDriverGatewayPid()/isPidAlive() and replace the current
isDockerDriverGatewayProcess(existingPid, gatewayBin) call with one that ensures
the gateway was started with the Docker-driver environment, leaving
getDockerDriverGatewayPidFile(), gatewayBin and the existingPid logic intact.

Comment thread src/lib/onboard.ts
Comment thread src/lib/shields/index.ts
Comment on lines +47 to +64
function resolveDockerDriverSandboxContainer(sandboxName: string): string | null {
try {
if (registry.getSandbox?.(sandboxName)?.openshellDriver !== "docker") {
return null;
}
} catch {
return null;
}
const prefix = `openshell-${sandboxName}-`;
const exact = `openshell-${sandboxName}`;
const output = dockerCapture(["ps", "--format", "{{.Names}}"], { ignoreError: true });
return (
output
.split("\n")
.map((line: string) => line.trim())
.find((name: string) => name === exact || name.startsWith(prefix)) || null
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when a Docker-driver sandbox container cannot be resolved.

If openshellDriver === "docker" but resolveDockerDriverSandboxContainer() returns null, this silently falls back to kubectlExecArgv(). There is no k3s gateway in Docker-driver mode, so every chmod/chown/stat call goes through the wrong backend instead of surfacing an actionable "container not found" error.

Suggested fix
 function privilegedSandboxExecArgv(sandboxName: string, cmd: string[]): string[] {
-  const dockerDriverContainer = resolveDockerDriverSandboxContainer(sandboxName);
-  if (dockerDriverContainer) {
-    return ["exec", "--user", "root", dockerDriverContainer, ...cmd];
+  const entry = registry.getSandbox?.(sandboxName);
+  if (entry?.openshellDriver === "docker") {
+    const dockerDriverContainer = resolveDockerDriverSandboxContainer(sandboxName);
+    if (!dockerDriverContainer) {
+      throw new Error(`Docker-driver sandbox container not found for '${sandboxName}'`);
+    }
+    return ["exec", "--user", "root", dockerDriverContainer, ...cmd];
   }
   return kubectlExecArgv(sandboxName, cmd);
 }

Also applies to: 82-87

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields.ts` around lines 47 - 64, When the registry reports
openshellDriver === "docker" but no matching container is found, fail fast
instead of returning null: modify resolveDockerDriverSandboxContainer to throw a
clear Error (including the sandboxName) when the registry says the sandbox uses
the docker driver yet dockerCapture.find(...) yields no container; update any
duplicate/second implementation in this file the same way so callers (e.g., code
that currently falls back to kubectlExecArgv) will get an actionable "container
not found" error rather than silently using the wrong backend.

Comment on lines +82 to +124
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi

cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
printf 'container:%s\n' "$cid"
return 0
fi

return 1
}

stop_gateway_runtime() {
local pid_file pid cid
openshell forward stop 18789 2>/dev/null || true
openshell gateway stop -g nemoclaw 2>/dev/null || true

pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break
sleep 1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true
fi
fi

cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
docker stop "$cid" >/dev/null 2>&1 || true
fi
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t kill arbitrary PIDs from a stale gateway pid file.

This helper assumes any live PID in openshell-gateway.pid is still the Docker-driver gateway. If that PID has been reused, gateway_runtime_id() will misidentify the runtime and stop_gateway_runtime() can terminate an unrelated process on the host.

Suggested hardening
+is_docker_driver_gateway_pid() {
+  local pid="$1"
+  [ -n "$pid" ] || return 1
+  kill -0 "$pid" 2>/dev/null || return 1
+  ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway'
+}
+
 gateway_runtime_id() {
   local pid_file pid cid
   pid_file="$(docker_driver_gateway_pid_file)"
   if [ -f "$pid_file" ]; then
     pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
-    if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+    if is_docker_driver_gateway_pid "$pid"; then
       printf 'pid:%s\n' "$pid"
       return 0
     fi
   fi
@@
   pid_file="$(docker_driver_gateway_pid_file)"
   if [ -f "$pid_file" ]; then
     pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
-    if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+    if is_docker_driver_gateway_pid "$pid"; then
       kill "$pid" 2>/dev/null || true
       for _ in $(seq 1 10); do
         kill -0 "$pid" 2>/dev/null || break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-double-onboard.sh` around lines 82 - 124, The current logic in
gateway_runtime_id() and stop_gateway_runtime() blindly trusts any live PID from
docker_driver_gateway_pid_file and may kill an unrelated reused PID; change both
functions to verify the PID belongs to the gateway process before acting: when
you read the pid from docker_driver_gateway_pid_file in gateway_runtime_id() and
stop_gateway_runtime(), inspect the process command (e.g., /proc/<pid>/cmdline
or ps output) to ensure it contains the expected gateway/driver marker (like the
openshell gateway/driver binary name or known invocation args) and only then
return or attempt to kill it; if the PID check fails, fall back to detecting the
Docker container (docker ps ... name=openshell-cluster-nemoclaw) and avoid
issuing kill on unverified PIDs. Ensure you reference
docker_driver_gateway_pid_file, gateway_runtime_id, and stop_gateway_runtime
when making the change.

Comment on lines +130 to +172
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi

cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
printf 'container:%s\n' "$cid"
return 0
fi

return 1
}

stop_gateway_runtime() {
local pid_file pid cid
openshell forward stop 18789 2>/dev/null || true
openshell gateway stop -g nemoclaw 2>/dev/null || true

pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break
sleep 1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true
fi
fi

cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
docker stop "$cid" >/dev/null 2>&1 || true
fi
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the PID before treating it as the gateway.

This helper trusts any live PID found in openshell-gateway.pid. If that file is stale and the PID has been recycled, gateway_runtime_id() will report the wrong runtime and stop_gateway_runtime() can send TERM/KILL to an unrelated host process.

Suggested hardening
+is_docker_driver_gateway_pid() {
+  local pid="$1"
+  [ -n "$pid" ] || return 1
+  kill -0 "$pid" 2>/dev/null || return 1
+  ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway'
+}
+
 gateway_runtime_id() {
   local pid_file pid cid
   pid_file="$(docker_driver_gateway_pid_file)"
   if [ -f "$pid_file" ]; then
     pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
-    if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+    if is_docker_driver_gateway_pid "$pid"; then
       printf 'pid:%s\n' "$pid"
       return 0
     fi
   fi
@@
   pid_file="$(docker_driver_gateway_pid_file)"
   if [ -f "$pid_file" ]; then
     pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
-    if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+    if is_docker_driver_gateway_pid "$pid"; then
       kill "$pid" 2>/dev/null || true
       for _ in $(seq 1 10); do
         kill -0 "$pid" 2>/dev/null || break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi
cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
printf 'container:%s\n' "$cid"
return 0
fi
return 1
}
stop_gateway_runtime() {
local pid_file pid cid
openshell forward stop 18789 2>/dev/null || true
openshell gateway stop -g nemoclaw 2>/dev/null || true
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break
sleep 1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true
fi
fi
cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
docker stop "$cid" >/dev/null 2>&1 || true
fi
}
is_docker_driver_gateway_pid() {
local pid="$1"
[ -n "$pid" ] || return 1
kill -0 "$pid" 2>/dev/null || return 1
ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway'
}
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if is_docker_driver_gateway_pid "$pid"; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi
cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
printf 'container:%s\n' "$cid"
return 0
fi
return 1
}
stop_gateway_runtime() {
local pid_file pid cid
openshell forward stop 18789 2>/dev/null || true
openshell gateway stop -g nemoclaw 2>/dev/null || true
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
if is_docker_driver_gateway_pid "$pid"; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break
sleep 1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true
fi
fi
cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)"
if [ -n "$cid" ]; then
docker stop "$cid" >/dev/null 2>&1 || true
fi
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-sandbox-survival.sh` around lines 130 - 172, The helper blindly
trusts any live PID in docker_driver_gateway_pid_file; update
gateway_runtime_id() and stop_gateway_runtime() to validate the PID belongs to
the gateway process before acting: after reading pid from
docker_driver_gateway_pid_file, check the process command line (e.g.
/proc/$pid/cmdline or ps -p "$pid" -o args=) for an expected marker such as
"openshell gateway" or the specific gateway args used by your runtime, and only
return/kill that pid if the marker matches; if the marker does not match, ignore
the pid and fall back to checking the docker container ID as currently done (and
apply the same validation in stop_gateway_runtime() before sending TERM/KILL).

Comment on lines +174 to +186
start_gateway_runtime() {
local previous_runtime="$1"
if [[ "$previous_runtime" == pid:* ]]; then
local recovery_log
recovery_log="$(mktemp)"
if nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then
pass "Gateway recovered through NemoClaw status"
else
info "NemoClaw status recovery returned non-zero; polling gateway health"
sed 's/^/ /' "$recovery_log" | tail -40 || true
fi
rm -f "$recovery_log"
return 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Put a timeout around the recovery nemoclaw status call.

This branch can hang indefinitely if status hits the same SSH/forward recovery issue you already defend against in Phase 7. When that happens, Phase 6 never reaches the health poll and the whole E2E run burns its outer timeout instead of failing fast.

Suggested fix
 start_gateway_runtime() {
   local previous_runtime="$1"
   if [[ "$previous_runtime" == pid:* ]]; then
-    local recovery_log
+    local recovery_log recovery_exit
     recovery_log="$(mktemp)"
-    if nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then
+    if run_with_timeout 120 nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then
       pass "Gateway recovered through NemoClaw status"
     else
+      recovery_exit=$?
+      if [ "$recovery_exit" -eq 124 ]; then
+        fail "Gateway recovery via NemoClaw status timed out"
+        sed 's/^/    /' "$recovery_log" | tail -40 || true
+        rm -f "$recovery_log"
+        return 1
+      fi
       info "NemoClaw status recovery returned non-zero; polling gateway health"
       sed 's/^/    /' "$recovery_log" | tail -40 || true
     fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-sandbox-survival.sh` around lines 174 - 186, The recovery
branch in start_gateway_runtime runs "nemoclaw \"$SANDBOX_NAME\" status" with no
timeout and can hang; wrap that call with a bounded timeout (e.g., use the
timeout command or a background timer) so the command is killed after a short
limit, capture its stdout/stderr into the existing recovery_log, and treat a
timeout as a non-zero failure so the code falls through to the health-polling
path; ensure you still print the last lines of recovery_log (sed/tail) on
failure and always rm -f "$recovery_log" before returning.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25349684174
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 20 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ❌ failure
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: network-policy-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25350806104
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: double-onboard-e2e,onboard-repair-e2e,network-policy-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ✅ success
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

…r-gpu-onboard

# Conflicts:
#	test/install-preflight.test.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25351590187
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: double-onboard-e2e,onboard-repair-e2e,network-policy-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ✅ success
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25592785835
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 23 passed, 2 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ⚠️ cancelled
credential-sanitization-e2e ✅ success
deployment-services-e2e ⚠️ cancelled
device-auth-health-e2e ✅ success
diagnostics-e2e ❌ failure
docs-validation-e2e ✅ success
double-onboard-e2e ⚠️ cancelled
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ❌ failure
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ⚠️ cancelled
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: diagnostics-e2e, hermes-slack-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25593021519
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 20 passed, 2 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ⚠️ cancelled
credential-sanitization-e2e ✅ success
deployment-services-e2e ⚠️ cancelled
device-auth-health-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
double-onboard-e2e ⚠️ cancelled
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ❌ failure
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ⚠️ cancelled
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
runtime-overrides-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ❌ failure
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: hermes-slack-e2e, skill-agent-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25593245651
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: hermes-slack-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
hermes-slack-e2e ❌ failure

Failed jobs: hermes-slack-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25593514742
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: hermes-slack-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-slack-e2e ✅ success

@ericksoa ericksoa added the v0.0.39 Release target label May 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25593719800
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 36 passed, 0 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@ericksoa ericksoa requested a review from cv May 9, 2026 13:55
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25631424645
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 36 passed, 0 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25639847344
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 36 passed, 0 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25679161251
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 34 passed, 2 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ❌ failure
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ❌ failure
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: messaging-compatible-endpoint-e2e, runtime-overrides-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25682068629
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 36 passed, 0 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25683897637
Branch: codex/openshell-docker-gpu-onboard
Requested jobs: all (no filter)
Summary: 36 passed, 0 failed, 2 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@ericksoa ericksoa merged commit eb15e55 into main May 11, 2026
62 checks passed
@ericksoa ericksoa deleted the codex/openshell-docker-gpu-onboard branch May 11, 2026 17:26
jyaunches added a commit that referenced this pull request May 11, 2026
## Coverage guard for #3111 — "Docker-driver gateway is healthy"
false-positive

This PR adds an E2E regression test that fails on `main` today. It is
intentional that **this test will be red on `main` and nightly will go
red** until the fix for #3111 lands.

### The gap

Issue #3111 reports that on Ubuntu 22.04 the onboard flow prints:

```
Starting OpenShell Docker-driver gateway...
Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
✓ Docker-driver gateway is healthy    ← false positive
```

while the gateway log shows the binary never actually ran:

```
openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found
openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found
```

The underlying NemoClaw bug is **platform-independent**:

- `startGateway()` (`src/lib/onboard.ts:5500+`) spawns the gateway
  binary with `detached: true` and `child.unref()`. When the binary
  crashes, the detached child becomes a **zombie**, and `isPidAlive()`
  (`process.kill(pid, 0)`) returns true for zombies — so the poll loop
  doesn't break.
- `registerDockerDriverGatewayEndpoint()` (`src/lib/onboard.ts:4347`) is
**metadata-only**: `openshell gateway add --local --name nemoclaw <url>`
  writes the endpoint to the config; it does NOT probe the endpoint.
- `isGatewayHealthy()` (`src/lib/state/gateway.ts:99`) is a **string
  match on `openshell status` and `openshell gateway info` output**, not
  a live health check.

Result: on any Linux host where the gateway binary fails to start for
any reason (GLIBC mismatch, missing shared lib, permissions, OOM,
CDI-spec error, corrupted binary…), onboard reports
`✓ Docker-driver gateway is healthy` and proceeds to the next onboard
step, which then fails with a confusing `Connection refused` downstream.

There is an existing `openshell-gateway-upgrade-e2e` test covering the
**stale-gateway-replaced** path for PR #3001, but no test covers the
**gateway-binary-crashes** path that is the root issue in #3111.

### What this test does

`test/e2e/test-gateway-health-honest.sh`:

1. Installs the real `openshell` + `openshell-gateway` binaries via
   `scripts/install-openshell.sh` (same setup path as the existing
   upgrade test).
2. Drops a sabotage shim at `$STATE_DIR/openshell-gateway-sabotage` that
   exits immediately with the same GLIBC-style stderr reported in #3111.
3. Invokes `startGateway(null)` via a Node heredoc, with
   `NEMOCLAW_OPENSHELL_GATEWAY_BIN` pointing at the shim.
4. Asserts:
   - **Primary:** the onboard output does NOT contain
     `"Docker-driver gateway is healthy"`.
   - **Corroborating:** the node process exits non-zero (≠ 0).
   - **Corroborating:** onboard surfaces a user-visible failure line
     (`failed to start`, `crash`, `exit`, `not found`, or a thrown
     exception).
   - **Corroborating:** no live non-zombie gateway process remains
     after the simulated crash.

The test runs on `ubuntu-latest` and does not require an Ubuntu 22.04
runner — it exercises the NemoClaw-side bug class, not the
OpenShell-side
GLIBC packaging choice. The GLIBC compatibility concern is an OpenShell
team issue and is out of scope for this coverage guard.

### Expected CI behavior

| Ref | Nightly job `gateway-health-honest-e2e` |
|---|---|
| **this PR** | PR checks don't run nightly, so this PR's CI is green
(modulo the usual unit/lint suite). |
| **main (after merge)** | **FAILS.** Primary assertion trips: onboard
logs `✓ Docker-driver gateway is healthy`. |
| **any branch that fixes #3111** | PASSES. |

### The red-nightly tradeoff

Once merged, the nightly badge will go red on
`gateway-health-honest-e2e`
until #3111 is fixed. That is the point — the failing test is the
executable acceptance criterion for the fix. A subsequent PR authored
via
`/skill:nemoclaw-issue-kickoff 3111` will produce the fix with this test
as its definition-of-done.

### Expected failure output on main

```
[FAIL] Onboard reported '✓ Docker-driver gateway is healthy' although
the gateway binary crashed on startup (#3111 false-positive health check)
[DIAG] start log tail:
  Starting OpenShell Docker-driver gateway...
  Gateway log: /home/runner/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
  ✓ Docker-driver gateway is healthy
  __onboard_startGateway_returned_successfully__
[DIAG] gateway log tail:
  openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found
  openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found
```

### Wiring

- New file: `test/e2e/test-gateway-health-honest.sh` (~170 LOC, modeled
  after `test/e2e/test-openshell-gateway-upgrade.sh`)
- New job: `gateway-health-honest-e2e` in
`.github/workflows/nightly-e2e.yaml`
(6 edits: comment, inputs.jobs description, new job block, 3 needs
arrays)

### References

- Issue #3111 (NV QA / UAT / Platform: Brev / Platform: Ubuntu / Docker
/ bug)
- NVB#6150133
- PR #3001 (merged; introduced the affected code path)
- Existing test being modeled:
`test/e2e/test-openshell-gateway-upgrade.sh`
- Relevant source:
  - `src/lib/onboard.ts` — `startGateway`, `startGatewayWithOptions`,
    `isPidAlive`, `registerDockerDriverGatewayEndpoint`
  - `src/lib/state/gateway.ts` — `isGatewayHealthy`


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added an end-to-end test that verifies gateway start failure is
detected, exposes user-visible error output, and ensures no lingering
gateway processes remain.
* **Chores**
* Integrated the new test into the nightly pipeline as a selectable job,
including artifact upload on failure and inclusion in failure reporting,
PR comments, and the nightly scorecard.
* **Documentation**
* Updated nightly workflow job selection documentation and added a
review-tool entry linking the test to the nightly job.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362)

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization enhancement: platform Request for support on other platforms. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents v0.0.39 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants