feat(onboard): use OpenShell Docker GPU sandboxes#3001
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds sandbox GPU support and OpenShell channel-aware installation; integrates a Linux Docker-driver gateway flow; persists and surfaces GPU/OpenShell metadata in registry, status, and inventory; implements a GLIBC-aware sandbox base-image resolver for CI; converts OpenShell installer to multi-asset release downloads; and updates tests including e2e GPU verification. ChangesSandbox GPU, Docker-Driver Gateway, and OpenShell Channel
Sequence DiagramsequenceDiagram
participant User
participant CLI as Onboard CLI
participant Core as Onboard Core
participant GPU as GPU Resolver
participant Openshell as OpenShell (gateway/sandbox)
participant Registry
User->>CLI: nemoclaw onboard --sandbox-gpu --sandbox-gpu-device 0
CLI->>CLI: parse flags (--sandbox-gpu, --sandbox-gpu-device)
CLI->>Core: runOnboard({ sandboxGpu:"enable", sandboxGpuDevice:"0", ... })
Core->>GPU: resolveSandboxGpuConfig(detectedGpu, { flag:"enable", device:"0" })
GPU-->>Core: SandboxGpuConfig { sandboxGpuEnabled:true, sandboxGpuMode:..., sandboxGpuDevice:"0" }
Core->>Openshell: prepare direct-GPU policy & create sandbox (--gpu --gpu-device 0)
Openshell->>Openshell: create sandbox, mount GPU device
Core->>Openshell: verifyDirectSandboxGpu(sandbox)
Openshell-->>Core: proof result (nvidia-smi, /proc write, cuInit)
Core->>Registry: registerSandbox(..., gpu fields, openshell version/driver)
Core-->>User: onboarding complete (sandbox with GPU)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/onboard.test.ts (1)
151-169:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStrengthen
isOnboardTestInternalsfor newly required helpers.Line 151 validates only the legacy helper subset, but newly required helpers (for example those added at Lines 186-205 and used at Lines 243-325) are now assumed callable. If one export is missing, the guard still passes and failures become late
... is not a functionerrors.🔧 Suggested patch
function isOnboardTestInternals( value: OnboardTestInternalsCandidate, ): value is OnboardTestInternals { return ( value !== null && @@ typeof value.buildSandboxConfigSyncScript === "function" && + typeof value.buildSandboxGpuCreateArgs === "function" && + typeof value.buildDirectGpuPolicyYaml === "function" && + typeof value.buildDirectSandboxGpuProofCommands === "function" && @@ typeof value.getBlueprintMinOpenshellVersion === "function" && typeof value.getBlueprintMaxOpenshellVersion === "function" && + typeof value.getDockerDriverGatewayEnv === "function" && + typeof value.isLinuxDockerDriverGatewayEnabled === "function" && + typeof value.parseDockerCdiSpecDirs === "function" && + typeof value.resolveSandboxGpuConfig === "function" && + typeof value.shouldAllowOpenshellAboveBlueprintMax === "function" && @@ typeof value.shouldRunCompatibleEndpointSandboxSmoke === "function" && typeof value.writeSandboxConfigSyncFile === "function" ); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard.test.ts` around lines 151 - 169, The type-guard isOnboardTestInternals only checks the legacy helper subset and must be extended to include typeof checks for every newly required helper export used later in the tests; update the isOnboardTestInternals function to assert that each new helper (the additional functions referenced by the tests beyond the existing list) exists and is a function so the guard fails early rather than producing late "is not a function" errors. Locate isOnboardTestInternals and add typeof checks for each of the new helper symbols introduced elsewhere in the test suite so the returned boolean accurately reflects the full required API surface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4068-4077: The restart path currently kills any live PID returned
by getDockerDriverGatewayPid() which can accidentally terminate unrelated
processes if the PID was recycled; modify the logic so after confirming
isPidAlive(existingPid) you validate that the live process belongs to the
gateway (e.g., inspect the process command line/executable via
/proc/<pid>/cmdline or a ps call and ensure it contains "openshell-gateway" or
the known gateway binary path), and only then call process.kill(existingPid,
"SIGTERM"); if the check fails, remove/clear the stale pid file (the same PID
file read by getDockerDriverGatewayPid()) and skip the kill attempt so unrelated
processes are not affected.
- Around line 1393-1408: verifyDirectSandboxGpu currently calls
process.exit(...) on failure which kills the process and leaves an orphaned
OpenShell sandbox; instead make failures propagate to the caller (createSandbox)
so it can clean up. Change verifyDirectSandboxGpu to stop calling process.exit
and instead throw an Error (or return a rejected Promise/false if you prefer
async patterns) containing the proof label, status and diagnostic info (use
result.status, proof.label and the compactText(redact(...)) diagnostic), so
createSandbox() can catch it, delete the sandbox, and then fail cleanly; keep
the existing logging (console.error lines and remediation lines) but remove
process.exit and replace it with the thrown/rejected failure.
In `@src/lib/sandbox-status-action.ts`:
- Around line 72-79: The status rendering treats sb.sandboxGpuEnabled as
absent/false and prints "disabled" for older entries that only set
sb.gpuEnabled; update the logic that computes sandboxGpu (currently using
sb.sandboxGpuEnabled) to fallback to sb.gpuEnabled when sandboxGpuEnabled is
undefined (i.e., use sandboxGpuEnabled ?? sb.gpuEnabled or explicit undefined
check) so legacy entries with gpuEnabled: true render as enabled; keep the rest
of the sandboxGpuMode/sandboxGpuDevice logic unchanged (refer to
sb.sandboxGpuEnabled, sb.gpuEnabled, and the sandboxGpu variable).
---
Outside diff comments:
In `@test/onboard.test.ts`:
- Around line 151-169: The type-guard isOnboardTestInternals only checks the
legacy helper subset and must be extended to include typeof checks for every
newly required helper export used later in the tests; update the
isOnboardTestInternals function to assert that each new helper (the additional
functions referenced by the tests beyond the existing list) exists and is a
function so the guard fails early rather than producing late "is not a function"
errors. Locate isOnboardTestInternals and add typeof checks for each of the new
helper symbols introduced elsewhere in the test suite so the returned boolean
accurately reflects the full required API surface.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d605cafe-ac24-43a5-aa80-fbfda7e5c795
📒 Files selected for processing (11)
scripts/install-openshell.shsrc/lib/inventory-commands.test.tssrc/lib/inventory-commands.tssrc/lib/onboard-command.test.tssrc/lib/onboard-command.tssrc/lib/onboard.tssrc/lib/registry.tssrc/lib/sandbox-status-action.tstest/e2e/test-gpu-e2e.shtest/install-openshell-version-check.test.tstest/onboard.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25340341019
|
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
3067-3070:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate the PID file still belongs to
openshell-gateway.
isDockerDriverGatewayProcessAlive()only checks whether the recorded PID exists. Ifopenshell-gateway.pidis stale and that PID has been recycled, the Linux reuse path can accept an unrelated process as “alive” and skip gateway recreation. Reuse should require both liveness andisDockerDriverGatewayProcess(...), and clear the pid file when identity no longer matches.Suggested direction
function isDockerDriverGatewayProcessAlive(): boolean { const pid = getDockerDriverGatewayPid(); - return pid !== null && isPidAlive(pid); + if (pid === null || !isPidAlive(pid)) return false; + const gatewayBin = resolveOpenShellGatewayBinary(); + if (!isDockerDriverGatewayProcess(pid, gatewayBin)) { + fs.rmSync(getDockerDriverGatewayPidFile(), { force: true }); + return false; + } + return true; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3067 - 3070, Update isDockerDriverGatewayProcessAlive to not only check PID liveness but also verify process identity using isDockerDriverGatewayProcess(pid); call getDockerDriverGatewayPid(), if pid is null return false, then if isPidAlive(pid) is true but isDockerDriverGatewayProcess(pid) is false, delete/clear the openshell-gateway PID file (use the same mechanism that writes the PID) and return false; only return true when both isPidAlive(pid) and isDockerDriverGatewayProcess(pid) are true. Reference functions: isDockerDriverGatewayProcessAlive, getDockerDriverGatewayPid, isPidAlive, isDockerDriverGatewayProcess.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3104-3115: The reuse path must only claim gateway reuse if the
endpoint was actually registered: update registerDockerDriverGatewayEndpoint()
so it sets process.env.OPENSHELL_GATEWAY only when both runOpenshell calls
succeeded (addResult.status === 0 && selectResult.status === 0) and return that
boolean; then change the reuse/early-return logic (the places around the
liveness/health check that currently return without checking
registerDockerDriverGatewayEndpoint) to require
registerDockerDriverGatewayEndpoint() to return true before announcing reuse or
exiting early. Use the existing symbols runOpenshell, addResult, selectResult,
GATEWAY_NAME and OPENSHELL_GATEWAY to locate and change the code.
- Around line 919-928: The current check only verifies that CDI directories
exist; instead, after calling getDockerCdiSpecDirs() iterate those directories
and search for at least one readable CDI spec file (match extensions .json or
.yaml/.yml) using fs.readdir/fs.stat (or their sync variants) and handle
permission errors; if no spec files are found, print the same remediation
messages (using sandboxGpuRemediationLines()) and exit(1); if spec files are
found, log the detected spec file paths in the success message (replace
cdiSpecDirs.join(...) with the actual spec file paths) so that the check ensures
Docker has real NVIDIA CDI spec files before declaring GPU support.
In `@test/runner.test.ts`:
- Around line 672-677: The stub that handles
openshell-gateway-checksums-sha256.txt and
openshell-sandbox-checksums-sha256.txt only emits x86_64 asset lines, so add
corresponding aarch64 entries to the printf output for both cases (i.e., emit
lines for 'openshell-gateway-aarch64-unknown-linux-gnu.tar.gz' and
'openshell-sandbox-aarch64-unknown-linux-gnu.tar.gz' in the same format as the
existing 'ignored openshell-*-x86_64-unknown-linux-gnu.tar.gz' lines) so the
grep-based checksum matching in the installer will find aarch64 entries when
running on Linux aarch64.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3067-3070: Update isDockerDriverGatewayProcessAlive to not only
check PID liveness but also verify process identity using
isDockerDriverGatewayProcess(pid); call getDockerDriverGatewayPid(), if pid is
null return false, then if isPidAlive(pid) is true but
isDockerDriverGatewayProcess(pid) is false, delete/clear the openshell-gateway
PID file (use the same mechanism that writes the PID) and return false; only
return true when both isPidAlive(pid) and isDockerDriverGatewayProcess(pid) are
true. Reference functions: isDockerDriverGatewayProcessAlive,
getDockerDriverGatewayPid, isPidAlive, isDockerDriverGatewayProcess.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c00cc609-0764-4942-9057-fbbbe157098f
📒 Files selected for processing (5)
src/lib/onboard.tssrc/lib/sandbox-status-action.tstest/cli.test.tstest/onboard.test.tstest/runner.test.ts
✅ Files skipped from review due to trivial changes (1)
- src/lib/sandbox-status-action.ts
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
3091-3101:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly mark the Docker-driver gateway reusable after add/select succeeds.
This helper still exports
OPENSHELL_GATEWAYeven whengateway add/selectfailed, and the reuse branch still returns without checking the boolean. That can make onboarding announce reuse while later commands target an unregistered or wrong gateway.Suggested fix
function registerDockerDriverGatewayEndpoint(): boolean { const addResult = runOpenshell( ["gateway", "add", "--local", "--name", GATEWAY_NAME, getDockerDriverGatewayEndpoint()], { ignoreError: true, suppressOutput: true }, ); const selectResult = runOpenshell(["gateway", "select", GATEWAY_NAME], { ignoreError: true, suppressOutput: true, }); - process.env.OPENSHELL_GATEWAY = GATEWAY_NAME; - return addResult.status === 0 && selectResult.status === 0; + const ok = addResult.status === 0 && selectResult.status === 0; + if (ok) process.env.OPENSHELL_GATEWAY = GATEWAY_NAME; + return ok; }if ( isDockerDriverGatewayProcessAlive() && isGatewayHealthy(gatewayStatus, gwInfo, activeGatewayInfo) ) { - console.log(" ✓ Reusing existing Docker-driver gateway"); - registerDockerDriverGatewayEndpoint(); - return; + if (registerDockerDriverGatewayEndpoint()) { + console.log(" ✓ Reusing existing Docker-driver gateway"); + return; + } }Also applies to: 4060-4066
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3091 - 3101, The registerDockerDriverGatewayEndpoint function sets process.env.OPENSHELL_GATEWAY and reports success even when runOpenshell add/select fail; change it to only set OPENSHELL_GATEWAY and return true when both addResult.status === 0 and selectResult.status === 0, otherwise return false (do not export OPENSHELL_GATEWAY on failure). Also update the reuse/early-return branch that announces reuse to check that this boolean success is true before claiming reuse (apply same change to the analogous implementation around the other occurrence noted).
925-935:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire a real NVIDIA CDI spec before passing GPU preflight.
Checking only
.CDISpecDirsstill lets empty CDI directories pass here, so onboarding can report GPU support and then fail later whensandbox create --gpuhas no NVIDIA spec to use. Validate that at least one readable NVIDIA.json/.yamlspec exists in those directories and log the resolved spec file paths instead of the directory list.Suggested fix
const cdiSpecDirs = getDockerCdiSpecDirs(); - if (cdiSpecDirs.length === 0) { + const cdiSpecFiles = cdiSpecDirs.flatMap((dir) => { + try { + return fs + .readdirSync(dir) + .filter((name) => /^nvidia.*\.(json|ya?ml)$/i.test(name)) + .map((name) => path.join(dir, name)); + } catch { + return []; + } + }); + if (cdiSpecFiles.length === 0) { console.error(""); console.error(" ✗ Docker CDI GPU support was not detected."); for (const line of sandboxGpuRemediationLines()) { console.error(` ${line}`); } process.exit(1); } - console.log(` ✓ Docker CDI GPU support detected (${cdiSpecDirs.join(", ")})`); + console.log(` ✓ Docker CDI GPU support detected (${cdiSpecFiles.join(", ")})`);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 925 - 935, The current check uses getDockerCdiSpecDirs() and treats any non-empty directory list as success; update the code to scan each directory in cdiSpecDirs for readable NVIDIA CDI spec files (extensions .json, .yaml, .yml), attempting to access and minimally validate each file (e.g., readable and contains expected NVIDIA/vendor or runtime fields) and collect their full file paths (suggest adding a helper like findReadableCdiSpecs or reuse getDockerCdiSpecDirs to return files). If no valid NVIDIA spec files are found, call sandboxGpuRemediationLines() and process.exit(1) as before; if found, log the resolved spec file paths (not just directories) in the success message and use those paths downstream (reference cdiSpecDirs, sandboxGpuRemediationLines, and the success log).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/actions/resolve-sandbox-base-image/action.yaml:
- Around line 55-57: After building the local fallback image with "docker build
-f Dockerfile.base -t nemoclaw-sandbox-base-local .", run the same glibc_version
ABI check inside that newly built image (e.g., docker run --rm
nemoclaw-sandbox-base-local sh -c '<glibc_version check command>') and only
export BASE_IMAGE=nemoclaw-sandbox-base-local to GITHUB_ENV if the check
succeeds; if the check fails, emit a failure/warning and exit non-zero so
downstream steps don't consume an incompatible base image. Ensure you reference
the built image tag "nemoclaw-sandbox-base-local", the Dockerfile
"Dockerfile.base" and the "BASE_IMAGE" env var in your fix.
In `@src/lib/agent-onboard.ts`:
- Around line 82-99: The code currently treats a null return from
resolveSandboxBaseImage() as if a compatible image exists by inspecting/building
`${baseImageName}:latest` and assigning baseImageRef = baseImageTag; instead,
update the branch handling when resolved is falsy so that
resolveSandboxBaseImage() failure is not silently bypassed: check the platform
(e.g., process.platform) and if on Linux, throw or return an explicit error
(including agent.displayName and baseImageName) instead of inspecting/building
:latest; for non-Linux platforms you may keep the existing
dockerImageInspect/dockerBuild flow but only after documenting/validating that
using baseImageTag is safe. Ensure you modify the code paths that set
baseImageRef, call dockerImageInspect, and call dockerBuild so a resolver
failure does not automatically set baseImageRef = baseImageTag on Linux.
In `@src/lib/onboard.ts`:
- Around line 1382-1383: The GPU-proof loop can hang; add a per-proof timeout
(e.g. 10_000 ms) around each runOpenshell invocation so a wedged probe will be
aborted and treated as failure; implement a small helper (e.g.
runOpenshellWithTimeout) that calls
buildDirectSandboxGpuProofCommands(sandboxName) entries and for each proof
starts runOpenshell(proof.args, { ignoreError: true, suppressOutput: true }),
sets a timer to kill/abort the underlying process (or call whatever runOpenshell
exposes to terminate) if it exceeds the timeout, clears the timer on completion,
and returns a timeout error result so the existing delete-and-fail path can
proceed. Ensure the timeout is per-proof and that the helper preserves the
original flags and error semantics.
In `@src/lib/sandbox-base-image.ts`:
- Around line 139-140: resolvePulledCandidate currently always attempts
dockerPull and treats non-pullable refs as failures even if the image exists
locally; change the logic in resolvePulledCandidate to first check the local
Docker cache for imageRef (use the existing docker inspect/inspectImage or
dockerImageExists helper if present, or call dockerInspect equivalent) and
return a successful result when found, and only call dockerPull(imageRef, {
ignoreError: true, suppressOutput: true }) if the image is not present locally;
keep the existing early return behavior (return null) only when neither local
inspect nor pull succeeds.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3091-3101: The registerDockerDriverGatewayEndpoint function sets
process.env.OPENSHELL_GATEWAY and reports success even when runOpenshell
add/select fail; change it to only set OPENSHELL_GATEWAY and return true when
both addResult.status === 0 and selectResult.status === 0, otherwise return
false (do not export OPENSHELL_GATEWAY on failure). Also update the
reuse/early-return branch that announces reuse to check that this boolean
success is true before claiming reuse (apply same change to the analogous
implementation around the other occurrence noted).
- Around line 925-935: The current check uses getDockerCdiSpecDirs() and treats
any non-empty directory list as success; update the code to scan each directory
in cdiSpecDirs for readable NVIDIA CDI spec files (extensions .json, .yaml,
.yml), attempting to access and minimally validate each file (e.g., readable and
contains expected NVIDIA/vendor or runtime fields) and collect their full file
paths (suggest adding a helper like findReadableCdiSpecs or reuse
getDockerCdiSpecDirs to return files). If no valid NVIDIA spec files are found,
call sandboxGpuRemediationLines() and process.exit(1) as before; if found, log
the resolved spec file paths (not just directories) in the success message and
use those paths downstream (reference cdiSpecDirs, sandboxGpuRemediationLines,
and the success log).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f456db68-980b-477f-aee3-7dfc8ef95aa0
📒 Files selected for processing (17)
.github/actions/resolve-hermes-base-image/action.yaml.github/actions/resolve-sandbox-base-image/action.yaml.github/workflows/base-image.yaml.github/workflows/docker-pin-check.yaml.github/workflows/pr-self-hosted.yamlDockerfileDockerfile.baseagents/hermes/Dockerfile.basescripts/brev-launchable-ci-cpu.shscripts/update-docker-pin.shsrc/lib/agent-onboard.tssrc/lib/onboard.tssrc/lib/sandbox-base-image.test.tssrc/lib/sandbox-base-image.tstest/hermes-share-mount-deps.test.tstest/onboard.test.tstest/sandbox-provisioning.test.ts
✅ Files skipped from review due to trivial changes (2)
- scripts/brev-launchable-ci-cpu.sh
- src/lib/sandbox-base-image.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25346185379
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile (1)
15-48: Run Dockerfile-focused e2e workflows before merge.Given these image-layer/runtime dependency changes, I strongly recommend running the targeted nightly e2e jobs (
cloud-e2e,sandbox-survival-e2e,hermes-e2e,rebuild-openclaw-e2e) on this branch to validate behavior under real container build/runtime conditions.As per coding guidelines,
Dockerfile: “This file affects the sandbox container image. Layer ordering, permissions, and baked config changes are only testable with a real container build.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile` around lines 15 - 48, This change updates the runtime Dockerfile behavior (see BASE_IMAGE usage and the apt-get/apt-mark block that conditionally installs procps/e2fsprogs, the packages variable logic, ps --version check and command -v chattr) so run the targeted e2e container workflows before merging: trigger cloud-e2e, sandbox-survival-e2e, hermes-e2e and rebuild-openclaw-e2e on this branch, verify the built image boots, that ps and chattr are present as expected, and confirm no runtime regressions from the apt removal/autoremove and conditional install logic; if any failures occur, revert or adjust the apt package pins/conditional branches and retest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@Dockerfile`:
- Around line 15-48: This change updates the runtime Dockerfile behavior (see
BASE_IMAGE usage and the apt-get/apt-mark block that conditionally installs
procps/e2fsprogs, the packages variable logic, ps --version check and command -v
chattr) so run the targeted e2e container workflows before merging: trigger
cloud-e2e, sandbox-survival-e2e, hermes-e2e and rebuild-openclaw-e2e on this
branch, verify the built image boots, that ps and chattr are present as
expected, and confirm no runtime regressions from the apt removal/autoremove and
conditional install logic; if any failures occur, revert or adjust the apt
package pins/conditional branches and retest.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3edafedd-f68f-43ce-bb44-ea76f984d80f
📒 Files selected for processing (5)
DockerfileDockerfile.baseagents/hermes/Dockerfile.basetest/hermes-share-mount-deps.test.tstest/sandbox-provisioning.test.ts
🚧 Files skipped from review as they are similar to previous changes (4)
- test/hermes-share-mount-deps.test.ts
- Dockerfile.base
- agents/hermes/Dockerfile.base
- test/sandbox-provisioning.test.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 25347377465
|
Selective E2E Results — ✅ All requested jobs passedRun: 25348829097
|
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25349131105
|
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lib/snapshot-action.ts (1)
186-201:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the target sandbox driver for post-create setup, not the source driver.
srcEntry.openshellDriverdescribes the source sandbox, but this path creates a brand-newdstNameusing the current runtime. After this PR those can diverge, so keying DNS-proxy setup off the source driver can run the wrong post-create flow. It also means the cloned registry metadata below can leave the destination tagged with the wrong driver for later commands. Please resolve the actual driver ofdstNameafter create and use that for target-side setup/registration.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/snapshot-action.ts` around lines 186 - 201, The code uses srcEntry.openshellDriver to decide post-create steps and to register metadata, but the destination sandbox (dstName) may have a different driver; instead, probe/resolve the actual driver for dstName after creation (e.g., call the function that returns the sandbox's runtime/driver) and use that resolved value for the DNS-proxy conditional (the run([...dnsScript...]) decision) and when cloning metadata in registry.registerSandbox (set the destination's openshellDriver/driver field to the resolved driver rather than copying srcEntry.openshellDriver); update references to srcEntry.openshellDriver to use the resolved target driver variable so post-create setup and registry metadata reflect the actual dst runtime.test/onboard.test.ts (1)
166-194:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep
isOnboardTestInternalsaligned with the expanded contract.
pullAndResolveBaseImageDigestis declared inOnboardTestInternals(Line 143) and destructured later, but the runtime guard does not validate it. A missing/renamed export would bypass the guard and fail later in a less obvious place.Suggested fix
function isOnboardTestInternals( value: OnboardTestInternalsCandidate, ): value is OnboardTestInternals { return ( value !== null && @@ typeof value.parsePolicyPresetEnv === "function" && typeof value.patchStagedDockerfile === "function" && + typeof value.pullAndResolveBaseImageDigest === "function" && typeof value.writeSandboxConfigSyncFile === "function" ); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard.test.ts` around lines 166 - 194, The type-guard isOnboardTestInternals is missing a check for pullAndResolveBaseImageDigest, so add a runtime validation that typeof value.pullAndResolveBaseImageDigest === "function" alongside the other function checks to ensure the destructured export exists and will not fail later when pullAndResolveBaseImageDigest is used.
🧹 Nitpick comments (1)
test/e2e/test-overlayfs-autofix.sh (1)
132-137: ⚡ Quick winUse a stable applicability probe instead of grepping
onboard.ts.This skip now depends on the exact source text
return platform === "linux";. A harmless refactor insrc/lib/onboard.tswill re-enable this E2E path and start mutating/etc/docker/daemon.jsonagain even if Linux onboarding is still Docker-driver-only. Please key this off a runtime flag/helper or another stable capability signal instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-overlayfs-autofix.sh` around lines 132 - 137, Replace the fragile grep probe with a runtime capability check: export a stable helper (e.g., isDockerDriverOnlyOnboarding or similar) from src/lib/onboard.ts that returns whether onboarding is Docker-driver-only, and update the test/e2e/test-overlayfs-autofix.sh to call that helper at runtime (for example via node -e 'console.log(require("src/lib/onboard").isDockerDriverOnlyOnboarding())') and base the skip on the helper's true/false result instead of grepping for the literal 'return platform === "linux";'.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 1365-1376: The code currently moves /proc into fsPolicy.read_write
which opens the entire procfs; instead, keep /proc in read-only and only add the
minimal required proc path to fsPolicy.read_write (e.g., the exact GPU-related
comm path such as `/proc/self/task/<tid>/comm` or a single specific GPU path
computed by the onboarding logic). Concretely, update the block that populates
fsPolicy.read_write (the readWrite variable and the conditional that pushes
"/proc") to not push "/proc" and to push only the precise path string you need
(compute it where you have the tid/GPU id), leaving parsed.filesystem_policy and
fsPolicy.read_only handling unchanged. Ensure the resulting fsPolicy.read_write
contains only that single allowed proc entry and not the whole "/proc".
- Around line 9580-9585: The code currently calls destroyGateway() and
registry.clearAll() whenever isLinuxDockerDriverGatewayEnabled() is true and
gatewayReuseState !== "missing", which can silently tear down k3s sandboxes;
change the logic in the migration block (the if that references
isLinuxDockerDriverGatewayEnabled and gatewayReuseState) to prevent automatic
destruction: require an explicit user confirmation or a migration flag before
calling destroyGateway() and registry.clearAll(), and implement a safe path that
either backs up existing sandbox metadata (e.g., export or copy registry state)
or skips clearing when the user refuses; keep runOpenshell(["forward", "stop",
String(DASHBOARD_PORT)], { ignoreError: true }) for cleanup but only proceed to
destroyGateway() and registry.clearAll() after confirmation or an explicit
opt-in to destructive migration.
- Around line 4233-4245: The restart branch may kill any openshell-gateway
process that matches the binary but not our runtime; update the check so
isDockerDriverGatewayProcess is invoked with requireDockerDriverEnv: true (or an
equivalent flag) when validating existingPid before deciding to SIGTERM it or
remove the pidfile; locate the logic around
getDockerDriverGatewayPid()/isPidAlive() and replace the current
isDockerDriverGatewayProcess(existingPid, gatewayBin) call with one that ensures
the gateway was started with the Docker-driver environment, leaving
getDockerDriverGatewayPidFile(), gatewayBin and the existingPid logic intact.
In `@src/lib/shields.ts`:
- Around line 47-64: When the registry reports openshellDriver === "docker" but
no matching container is found, fail fast instead of returning null: modify
resolveDockerDriverSandboxContainer to throw a clear Error (including the
sandboxName) when the registry says the sandbox uses the docker driver yet
dockerCapture.find(...) yields no container; update any duplicate/second
implementation in this file the same way so callers (e.g., code that currently
falls back to kubectlExecArgv) will get an actionable "container not found"
error rather than silently using the wrong backend.
In `@test/e2e/test-double-onboard.sh`:
- Around line 82-124: The current logic in gateway_runtime_id() and
stop_gateway_runtime() blindly trusts any live PID from
docker_driver_gateway_pid_file and may kill an unrelated reused PID; change both
functions to verify the PID belongs to the gateway process before acting: when
you read the pid from docker_driver_gateway_pid_file in gateway_runtime_id() and
stop_gateway_runtime(), inspect the process command (e.g., /proc/<pid>/cmdline
or ps output) to ensure it contains the expected gateway/driver marker (like the
openshell gateway/driver binary name or known invocation args) and only then
return or attempt to kill it; if the PID check fails, fall back to detecting the
Docker container (docker ps ... name=openshell-cluster-nemoclaw) and avoid
issuing kill on unverified PIDs. Ensure you reference
docker_driver_gateway_pid_file, gateway_runtime_id, and stop_gateway_runtime
when making the change.
In `@test/e2e/test-sandbox-survival.sh`:
- Around line 174-186: The recovery branch in start_gateway_runtime runs
"nemoclaw \"$SANDBOX_NAME\" status" with no timeout and can hang; wrap that call
with a bounded timeout (e.g., use the timeout command or a background timer) so
the command is killed after a short limit, capture its stdout/stderr into the
existing recovery_log, and treat a timeout as a non-zero failure so the code
falls through to the health-polling path; ensure you still print the last lines
of recovery_log (sed/tail) on failure and always rm -f "$recovery_log" before
returning.
- Around line 130-172: The helper blindly trusts any live PID in
docker_driver_gateway_pid_file; update gateway_runtime_id() and
stop_gateway_runtime() to validate the PID belongs to the gateway process before
acting: after reading pid from docker_driver_gateway_pid_file, check the process
command line (e.g. /proc/$pid/cmdline or ps -p "$pid" -o args=) for an expected
marker such as "openshell gateway" or the specific gateway args used by your
runtime, and only return/kill that pid if the marker matches; if the marker does
not match, ignore the pid and fall back to checking the docker container ID as
currently done (and apply the same validation in stop_gateway_runtime() before
sending TERM/KILL).
---
Outside diff comments:
In `@src/lib/snapshot-action.ts`:
- Around line 186-201: The code uses srcEntry.openshellDriver to decide
post-create steps and to register metadata, but the destination sandbox
(dstName) may have a different driver; instead, probe/resolve the actual driver
for dstName after creation (e.g., call the function that returns the sandbox's
runtime/driver) and use that resolved value for the DNS-proxy conditional (the
run([...dnsScript...]) decision) and when cloning metadata in
registry.registerSandbox (set the destination's openshellDriver/driver field to
the resolved driver rather than copying srcEntry.openshellDriver); update
references to srcEntry.openshellDriver to use the resolved target driver
variable so post-create setup and registry metadata reflect the actual dst
runtime.
In `@test/onboard.test.ts`:
- Around line 166-194: The type-guard isOnboardTestInternals is missing a check
for pullAndResolveBaseImageDigest, so add a runtime validation that typeof
value.pullAndResolveBaseImageDigest === "function" alongside the other function
checks to ensure the destructured export exists and will not fail later when
pullAndResolveBaseImageDigest is used.
---
Nitpick comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 132-137: Replace the fragile grep probe with a runtime capability
check: export a stable helper (e.g., isDockerDriverOnlyOnboarding or similar)
from src/lib/onboard.ts that returns whether onboarding is Docker-driver-only,
and update the test/e2e/test-overlayfs-autofix.sh to call that helper at runtime
(for example via node -e
'console.log(require("src/lib/onboard").isDockerDriverOnlyOnboarding())') and
base the skip on the helper's true/false result instead of grepping for the
literal 'return platform === "linux";'.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 52c5ffcc-4502-483f-8d03-83db3a16ca42
📒 Files selected for processing (18)
.github/actions/resolve-hermes-base-image/action.yaml.github/actions/resolve-sandbox-base-image/action.yamlDockerfileDockerfile.baseagents/hermes/Dockerfile.basesrc/lib/agent-onboard.tssrc/lib/onboard.tssrc/lib/sandbox-base-image.tssrc/lib/shields.tssrc/lib/snapshot-action.tstest/e2e/test-double-onboard.shtest/e2e/test-overlayfs-autofix.shtest/e2e/test-sandbox-survival.shtest/hermes-share-mount-deps.test.tstest/install-preflight.test.tstest/onboard.test.tstest/runner.test.tstest/sandbox-provisioning.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/hermes-share-mount-deps.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- Dockerfile.base
- Dockerfile
- test/sandbox-provisioning.test.ts
| parsed.filesystem_policy = parsed.filesystem_policy || {}; | ||
| const fsPolicy = parsed.filesystem_policy; | ||
| fsPolicy.read_only = Array.isArray(fsPolicy.read_only) | ||
| ? fsPolicy.read_only.filter((entry: unknown) => String(entry) !== "/proc") | ||
| : []; | ||
| const readWrite = Array.isArray(fsPolicy.read_write) | ||
| ? fsPolicy.read_write.map((entry: unknown) => String(entry)) | ||
| : []; | ||
| if (!readWrite.includes("/proc")) { | ||
| readWrite.push("/proc"); | ||
| } | ||
| fsPolicy.read_write = readWrite; |
There was a problem hiding this comment.
Keep /proc read-only except for the exact GPU path you need.
This change flips the whole /proc tree from read-only to read-write for every direct-GPU sandbox. The proof here only needs /proc/self/task/<tid>/comm, so widening all of procfs materially weakens the sandbox boundary for a much larger surface than required.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 1365 - 1376, The code currently moves /proc
into fsPolicy.read_write which opens the entire procfs; instead, keep /proc in
read-only and only add the minimal required proc path to fsPolicy.read_write
(e.g., the exact GPU-related comm path such as `/proc/self/task/<tid>/comm` or a
single specific GPU path computed by the onboarding logic). Concretely, update
the block that populates fsPolicy.read_write (the readWrite variable and the
conditional that pushes "/proc") to not push "/proc" and to push only the
precise path string you need (compute it where you have the tid/GPU id), leaving
parsed.filesystem_policy and fsPolicy.read_only handling unchanged. Ensure the
resulting fsPolicy.read_write contains only that single allowed proc entry and
not the whole "/proc".
| const existingPid = getDockerDriverGatewayPid() ?? portListenerPid; | ||
| if (existingPid !== null && isPidAlive(existingPid)) { | ||
| if (!isDockerDriverGatewayProcess(existingPid, gatewayBin)) { | ||
| fs.rmSync(getDockerDriverGatewayPidFile(), { force: true }); | ||
| } else { | ||
| console.log(` Restarting unhealthy Docker-driver gateway process (PID ${existingPid})...`); | ||
| try { | ||
| process.kill(existingPid, "SIGTERM"); | ||
| sleep(1); | ||
| } catch { | ||
| /* best effort; the new process will surface any remaining port conflict */ | ||
| } | ||
| } |
There was a problem hiding this comment.
Only SIGTERM processes proven to be the Docker-driver gateway.
This restart branch checks isDockerDriverGatewayProcess(existingPid, gatewayBin) without requireDockerDriverEnv: true, so any other openshell-gateway process using the same binary can be killed here. On hosts that also have a packaged or legacy gateway service, onboarding can terminate the wrong process.
Suggested fix
- if (existingPid !== null && isPidAlive(existingPid)) {
- if (!isDockerDriverGatewayProcess(existingPid, gatewayBin)) {
+ if (existingPid !== null && isPidAlive(existingPid)) {
+ if (
+ !isDockerDriverGatewayProcess(existingPid, gatewayBin, {
+ requireDockerDriverEnv: true,
+ })
+ ) {
fs.rmSync(getDockerDriverGatewayPidFile(), { force: true });
} else {
console.log(` Restarting unhealthy Docker-driver gateway process (PID ${existingPid})...`);
try {
process.kill(existingPid, "SIGTERM");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 4233 - 4245, The restart branch may kill any
openshell-gateway process that matches the binary but not our runtime; update
the check so isDockerDriverGatewayProcess is invoked with
requireDockerDriverEnv: true (or an equivalent flag) when validating existingPid
before deciding to SIGTERM it or remove the pidfile; locate the logic around
getDockerDriverGatewayPid()/isPidAlive() and replace the current
isDockerDriverGatewayProcess(existingPid, gatewayBin) call with one that ensures
the gateway was started with the Docker-driver environment, leaving
getDockerDriverGatewayPidFile(), gatewayBin and the existingPid logic intact.
| function resolveDockerDriverSandboxContainer(sandboxName: string): string | null { | ||
| try { | ||
| if (registry.getSandbox?.(sandboxName)?.openshellDriver !== "docker") { | ||
| return null; | ||
| } | ||
| } catch { | ||
| return null; | ||
| } | ||
| const prefix = `openshell-${sandboxName}-`; | ||
| const exact = `openshell-${sandboxName}`; | ||
| const output = dockerCapture(["ps", "--format", "{{.Names}}"], { ignoreError: true }); | ||
| return ( | ||
| output | ||
| .split("\n") | ||
| .map((line: string) => line.trim()) | ||
| .find((name: string) => name === exact || name.startsWith(prefix)) || null | ||
| ); | ||
| } |
There was a problem hiding this comment.
Fail fast when a Docker-driver sandbox container cannot be resolved.
If openshellDriver === "docker" but resolveDockerDriverSandboxContainer() returns null, this silently falls back to kubectlExecArgv(). There is no k3s gateway in Docker-driver mode, so every chmod/chown/stat call goes through the wrong backend instead of surfacing an actionable "container not found" error.
Suggested fix
function privilegedSandboxExecArgv(sandboxName: string, cmd: string[]): string[] {
- const dockerDriverContainer = resolveDockerDriverSandboxContainer(sandboxName);
- if (dockerDriverContainer) {
- return ["exec", "--user", "root", dockerDriverContainer, ...cmd];
+ const entry = registry.getSandbox?.(sandboxName);
+ if (entry?.openshellDriver === "docker") {
+ const dockerDriverContainer = resolveDockerDriverSandboxContainer(sandboxName);
+ if (!dockerDriverContainer) {
+ throw new Error(`Docker-driver sandbox container not found for '${sandboxName}'`);
+ }
+ return ["exec", "--user", "root", dockerDriverContainer, ...cmd];
}
return kubectlExecArgv(sandboxName, cmd);
}Also applies to: 82-87
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/shields.ts` around lines 47 - 64, When the registry reports
openshellDriver === "docker" but no matching container is found, fail fast
instead of returning null: modify resolveDockerDriverSandboxContainer to throw a
clear Error (including the sandboxName) when the registry says the sandbox uses
the docker driver yet dockerCapture.find(...) yields no container; update any
duplicate/second implementation in this file the same way so callers (e.g., code
that currently falls back to kubectlExecArgv) will get an actionable "container
not found" error rather than silently using the wrong backend.
| gateway_runtime_id() { | ||
| local pid_file pid cid | ||
| pid_file="$(docker_driver_gateway_pid_file)" | ||
| if [ -f "$pid_file" ]; then | ||
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | ||
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | ||
| printf 'pid:%s\n' "$pid" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | ||
| if [ -n "$cid" ]; then | ||
| printf 'container:%s\n' "$cid" | ||
| return 0 | ||
| fi | ||
|
|
||
| return 1 | ||
| } | ||
|
|
||
| stop_gateway_runtime() { | ||
| local pid_file pid cid | ||
| openshell forward stop 18789 2>/dev/null || true | ||
| openshell gateway stop -g nemoclaw 2>/dev/null || true | ||
|
|
||
| pid_file="$(docker_driver_gateway_pid_file)" | ||
| if [ -f "$pid_file" ]; then | ||
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | ||
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | ||
| kill "$pid" 2>/dev/null || true | ||
| for _ in $(seq 1 10); do | ||
| kill -0 "$pid" 2>/dev/null || break | ||
| sleep 1 | ||
| done | ||
| kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true | ||
| fi | ||
| fi | ||
|
|
||
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | ||
| if [ -n "$cid" ]; then | ||
| docker stop "$cid" >/dev/null 2>&1 || true | ||
| fi | ||
| } |
There was a problem hiding this comment.
Don’t kill arbitrary PIDs from a stale gateway pid file.
This helper assumes any live PID in openshell-gateway.pid is still the Docker-driver gateway. If that PID has been reused, gateway_runtime_id() will misidentify the runtime and stop_gateway_runtime() can terminate an unrelated process on the host.
Suggested hardening
+is_docker_driver_gateway_pid() {
+ local pid="$1"
+ [ -n "$pid" ] || return 1
+ kill -0 "$pid" 2>/dev/null || return 1
+ ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway'
+}
+
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
- if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+ if is_docker_driver_gateway_pid "$pid"; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi
@@
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
- if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+ if is_docker_driver_gateway_pid "$pid"; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/test-double-onboard.sh` around lines 82 - 124, The current logic in
gateway_runtime_id() and stop_gateway_runtime() blindly trusts any live PID from
docker_driver_gateway_pid_file and may kill an unrelated reused PID; change both
functions to verify the PID belongs to the gateway process before acting: when
you read the pid from docker_driver_gateway_pid_file in gateway_runtime_id() and
stop_gateway_runtime(), inspect the process command (e.g., /proc/<pid>/cmdline
or ps output) to ensure it contains the expected gateway/driver marker (like the
openshell gateway/driver binary name or known invocation args) and only then
return or attempt to kill it; if the PID check fails, fall back to detecting the
Docker container (docker ps ... name=openshell-cluster-nemoclaw) and avoid
issuing kill on unverified PIDs. Ensure you reference
docker_driver_gateway_pid_file, gateway_runtime_id, and stop_gateway_runtime
when making the change.
| gateway_runtime_id() { | ||
| local pid_file pid cid | ||
| pid_file="$(docker_driver_gateway_pid_file)" | ||
| if [ -f "$pid_file" ]; then | ||
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | ||
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | ||
| printf 'pid:%s\n' "$pid" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | ||
| if [ -n "$cid" ]; then | ||
| printf 'container:%s\n' "$cid" | ||
| return 0 | ||
| fi | ||
|
|
||
| return 1 | ||
| } | ||
|
|
||
| stop_gateway_runtime() { | ||
| local pid_file pid cid | ||
| openshell forward stop 18789 2>/dev/null || true | ||
| openshell gateway stop -g nemoclaw 2>/dev/null || true | ||
|
|
||
| pid_file="$(docker_driver_gateway_pid_file)" | ||
| if [ -f "$pid_file" ]; then | ||
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | ||
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | ||
| kill "$pid" 2>/dev/null || true | ||
| for _ in $(seq 1 10); do | ||
| kill -0 "$pid" 2>/dev/null || break | ||
| sleep 1 | ||
| done | ||
| kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true | ||
| fi | ||
| fi | ||
|
|
||
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | ||
| if [ -n "$cid" ]; then | ||
| docker stop "$cid" >/dev/null 2>&1 || true | ||
| fi | ||
| } |
There was a problem hiding this comment.
Validate the PID before treating it as the gateway.
This helper trusts any live PID found in openshell-gateway.pid. If that file is stale and the PID has been recycled, gateway_runtime_id() will report the wrong runtime and stop_gateway_runtime() can send TERM/KILL to an unrelated host process.
Suggested hardening
+is_docker_driver_gateway_pid() {
+ local pid="$1"
+ [ -n "$pid" ] || return 1
+ kill -0 "$pid" 2>/dev/null || return 1
+ ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway'
+}
+
gateway_runtime_id() {
local pid_file pid cid
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
- if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+ if is_docker_driver_gateway_pid "$pid"; then
printf 'pid:%s\n' "$pid"
return 0
fi
fi
@@
pid_file="$(docker_driver_gateway_pid_file)"
if [ -f "$pid_file" ]; then
pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)"
- if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+ if is_docker_driver_gateway_pid "$pid"; then
kill "$pid" 2>/dev/null || true
for _ in $(seq 1 10); do
kill -0 "$pid" 2>/dev/null || break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| gateway_runtime_id() { | |
| local pid_file pid cid | |
| pid_file="$(docker_driver_gateway_pid_file)" | |
| if [ -f "$pid_file" ]; then | |
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | |
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | |
| printf 'pid:%s\n' "$pid" | |
| return 0 | |
| fi | |
| fi | |
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | |
| if [ -n "$cid" ]; then | |
| printf 'container:%s\n' "$cid" | |
| return 0 | |
| fi | |
| return 1 | |
| } | |
| stop_gateway_runtime() { | |
| local pid_file pid cid | |
| openshell forward stop 18789 2>/dev/null || true | |
| openshell gateway stop -g nemoclaw 2>/dev/null || true | |
| pid_file="$(docker_driver_gateway_pid_file)" | |
| if [ -f "$pid_file" ]; then | |
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | |
| if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then | |
| kill "$pid" 2>/dev/null || true | |
| for _ in $(seq 1 10); do | |
| kill -0 "$pid" 2>/dev/null || break | |
| sleep 1 | |
| done | |
| kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true | |
| fi | |
| fi | |
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | |
| if [ -n "$cid" ]; then | |
| docker stop "$cid" >/dev/null 2>&1 || true | |
| fi | |
| } | |
| is_docker_driver_gateway_pid() { | |
| local pid="$1" | |
| [ -n "$pid" ] || return 1 | |
| kill -0 "$pid" 2>/dev/null || return 1 | |
| ps -p "$pid" -o command= 2>/dev/null | grep -Fq 'openshell-gateway' | |
| } | |
| gateway_runtime_id() { | |
| local pid_file pid cid | |
| pid_file="$(docker_driver_gateway_pid_file)" | |
| if [ -f "$pid_file" ]; then | |
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | |
| if is_docker_driver_gateway_pid "$pid"; then | |
| printf 'pid:%s\n' "$pid" | |
| return 0 | |
| fi | |
| fi | |
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | |
| if [ -n "$cid" ]; then | |
| printf 'container:%s\n' "$cid" | |
| return 0 | |
| fi | |
| return 1 | |
| } | |
| stop_gateway_runtime() { | |
| local pid_file pid cid | |
| openshell forward stop 18789 2>/dev/null || true | |
| openshell gateway stop -g nemoclaw 2>/dev/null || true | |
| pid_file="$(docker_driver_gateway_pid_file)" | |
| if [ -f "$pid_file" ]; then | |
| pid="$(tr -d '[:space:]' <"$pid_file" 2>/dev/null || true)" | |
| if is_docker_driver_gateway_pid "$pid"; then | |
| kill "$pid" 2>/dev/null || true | |
| for _ in $(seq 1 10); do | |
| kill -0 "$pid" 2>/dev/null || break | |
| sleep 1 | |
| done | |
| kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true | |
| fi | |
| fi | |
| cid="$(docker ps -qf "name=openshell-cluster-nemoclaw" 2>/dev/null | head -1)" | |
| if [ -n "$cid" ]; then | |
| docker stop "$cid" >/dev/null 2>&1 || true | |
| fi | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/test-sandbox-survival.sh` around lines 130 - 172, The helper blindly
trusts any live PID in docker_driver_gateway_pid_file; update
gateway_runtime_id() and stop_gateway_runtime() to validate the PID belongs to
the gateway process before acting: after reading pid from
docker_driver_gateway_pid_file, check the process command line (e.g.
/proc/$pid/cmdline or ps -p "$pid" -o args=) for an expected marker such as
"openshell gateway" or the specific gateway args used by your runtime, and only
return/kill that pid if the marker matches; if the marker does not match, ignore
the pid and fall back to checking the docker container ID as currently done (and
apply the same validation in stop_gateway_runtime() before sending TERM/KILL).
| start_gateway_runtime() { | ||
| local previous_runtime="$1" | ||
| if [[ "$previous_runtime" == pid:* ]]; then | ||
| local recovery_log | ||
| recovery_log="$(mktemp)" | ||
| if nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then | ||
| pass "Gateway recovered through NemoClaw status" | ||
| else | ||
| info "NemoClaw status recovery returned non-zero; polling gateway health" | ||
| sed 's/^/ /' "$recovery_log" | tail -40 || true | ||
| fi | ||
| rm -f "$recovery_log" | ||
| return 0 |
There was a problem hiding this comment.
Put a timeout around the recovery nemoclaw status call.
This branch can hang indefinitely if status hits the same SSH/forward recovery issue you already defend against in Phase 7. When that happens, Phase 6 never reaches the health poll and the whole E2E run burns its outer timeout instead of failing fast.
Suggested fix
start_gateway_runtime() {
local previous_runtime="$1"
if [[ "$previous_runtime" == pid:* ]]; then
- local recovery_log
+ local recovery_log recovery_exit
recovery_log="$(mktemp)"
- if nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then
+ if run_with_timeout 120 nemoclaw "$SANDBOX_NAME" status >"$recovery_log" 2>&1; then
pass "Gateway recovered through NemoClaw status"
else
+ recovery_exit=$?
+ if [ "$recovery_exit" -eq 124 ]; then
+ fail "Gateway recovery via NemoClaw status timed out"
+ sed 's/^/ /' "$recovery_log" | tail -40 || true
+ rm -f "$recovery_log"
+ return 1
+ fi
info "NemoClaw status recovery returned non-zero; polling gateway health"
sed 's/^/ /' "$recovery_log" | tail -40 || true
fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/test-sandbox-survival.sh` around lines 174 - 186, The recovery
branch in start_gateway_runtime runs "nemoclaw \"$SANDBOX_NAME\" status" with no
timeout and can hang; wrap that call with a bounded timeout (e.g., use the
timeout command or a background timer) so the command is killed after a short
limit, capture its stdout/stderr into the existing recovery_log, and treat a
timeout as a non-zero failure so the code falls through to the health-polling
path; ensure you still print the last lines of recovery_log (sed/tail) on
failure and always rm -f "$recovery_log" before returning.
Selective E2E Results — ❌ Some jobs failedRun: 25349684174
|
Selective E2E Results — ✅ All requested jobs passedRun: 25350806104
|
…r-gpu-onboard # Conflicts: # test/install-preflight.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 25351590187
|
Selective E2E Results — ❌ Some jobs failedRun: 25592785835
|
Selective E2E Results — ❌ Some jobs failedRun: 25593021519
|
Selective E2E Results — ❌ Some jobs failedRun: 25593245651
|
Selective E2E Results — ✅ All requested jobs passedRun: 25593514742
|
Selective E2E Results — ✅ All requested jobs passedRun: 25593719800
|
Selective E2E Results — ✅ All requested jobs passedRun: 25631424645
|
Selective E2E Results — ✅ All requested jobs passedRun: 25639847344
|
Selective E2E Results — ❌ Some jobs failedRun: 25679161251
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25682068629
|
Selective E2E Results — ✅ All requested jobs passedRun: 25683897637
|
## Coverage guard for #3111 — "Docker-driver gateway is healthy" false-positive This PR adds an E2E regression test that fails on `main` today. It is intentional that **this test will be red on `main` and nightly will go red** until the fix for #3111 lands. ### The gap Issue #3111 reports that on Ubuntu 22.04 the onboard flow prints: ``` Starting OpenShell Docker-driver gateway... Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log ✓ Docker-driver gateway is healthy ← false positive ``` while the gateway log shows the binary never actually ran: ``` openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found ``` The underlying NemoClaw bug is **platform-independent**: - `startGateway()` (`src/lib/onboard.ts:5500+`) spawns the gateway binary with `detached: true` and `child.unref()`. When the binary crashes, the detached child becomes a **zombie**, and `isPidAlive()` (`process.kill(pid, 0)`) returns true for zombies — so the poll loop doesn't break. - `registerDockerDriverGatewayEndpoint()` (`src/lib/onboard.ts:4347`) is **metadata-only**: `openshell gateway add --local --name nemoclaw <url>` writes the endpoint to the config; it does NOT probe the endpoint. - `isGatewayHealthy()` (`src/lib/state/gateway.ts:99`) is a **string match on `openshell status` and `openshell gateway info` output**, not a live health check. Result: on any Linux host where the gateway binary fails to start for any reason (GLIBC mismatch, missing shared lib, permissions, OOM, CDI-spec error, corrupted binary…), onboard reports `✓ Docker-driver gateway is healthy` and proceeds to the next onboard step, which then fails with a confusing `Connection refused` downstream. There is an existing `openshell-gateway-upgrade-e2e` test covering the **stale-gateway-replaced** path for PR #3001, but no test covers the **gateway-binary-crashes** path that is the root issue in #3111. ### What this test does `test/e2e/test-gateway-health-honest.sh`: 1. Installs the real `openshell` + `openshell-gateway` binaries via `scripts/install-openshell.sh` (same setup path as the existing upgrade test). 2. Drops a sabotage shim at `$STATE_DIR/openshell-gateway-sabotage` that exits immediately with the same GLIBC-style stderr reported in #3111. 3. Invokes `startGateway(null)` via a Node heredoc, with `NEMOCLAW_OPENSHELL_GATEWAY_BIN` pointing at the shim. 4. Asserts: - **Primary:** the onboard output does NOT contain `"Docker-driver gateway is healthy"`. - **Corroborating:** the node process exits non-zero (≠ 0). - **Corroborating:** onboard surfaces a user-visible failure line (`failed to start`, `crash`, `exit`, `not found`, or a thrown exception). - **Corroborating:** no live non-zombie gateway process remains after the simulated crash. The test runs on `ubuntu-latest` and does not require an Ubuntu 22.04 runner — it exercises the NemoClaw-side bug class, not the OpenShell-side GLIBC packaging choice. The GLIBC compatibility concern is an OpenShell team issue and is out of scope for this coverage guard. ### Expected CI behavior | Ref | Nightly job `gateway-health-honest-e2e` | |---|---| | **this PR** | PR checks don't run nightly, so this PR's CI is green (modulo the usual unit/lint suite). | | **main (after merge)** | **FAILS.** Primary assertion trips: onboard logs `✓ Docker-driver gateway is healthy`. | | **any branch that fixes #3111** | PASSES. | ### The red-nightly tradeoff Once merged, the nightly badge will go red on `gateway-health-honest-e2e` until #3111 is fixed. That is the point — the failing test is the executable acceptance criterion for the fix. A subsequent PR authored via `/skill:nemoclaw-issue-kickoff 3111` will produce the fix with this test as its definition-of-done. ### Expected failure output on main ``` [FAIL] Onboard reported '✓ Docker-driver gateway is healthy' although the gateway binary crashed on startup (#3111 false-positive health check) [DIAG] start log tail: Starting OpenShell Docker-driver gateway... Gateway log: /home/runner/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log ✓ Docker-driver gateway is healthy __onboard_startGateway_returned_successfully__ [DIAG] gateway log tail: openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found ``` ### Wiring - New file: `test/e2e/test-gateway-health-honest.sh` (~170 LOC, modeled after `test/e2e/test-openshell-gateway-upgrade.sh`) - New job: `gateway-health-honest-e2e` in `.github/workflows/nightly-e2e.yaml` (6 edits: comment, inputs.jobs description, new job block, 3 needs arrays) ### References - Issue #3111 (NV QA / UAT / Platform: Brev / Platform: Ubuntu / Docker / bug) - NVB#6150133 - PR #3001 (merged; introduced the affected code path) - Existing test being modeled: `test/e2e/test-openshell-gateway-upgrade.sh` - Relevant source: - `src/lib/onboard.ts` — `startGateway`, `startGatewayWithOptions`, `isPidAlive`, `registerDockerDriverGatewayEndpoint` - `src/lib/state/gateway.ts` — `isGatewayHealthy` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added an end-to-end test that verifies gateway start failure is detected, exposes user-visible error output, and ensures no lingering gateway processes remain. * **Chores** * Integrated the new test into the nightly pipeline as a selectable job, including artifact upload on failure and inclusion in failure reporting, PR comments, and the nightly scorecard. * **Documentation** * Updated nightly workflow job selection documentation and added a review-tool entry linking the test to the nightly job. [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362) [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Refreshes the release-prep docs for v0.0.39 based on changes merged since the Friday 4pm doc refresh. Updates the source docs, bumps the docs version metadata, and regenerates the NemoClaw user skills from the refreshed docs. ## Changes - #3314 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documents installer Docker setup, Docker group activation, and retry guidance. - #3317 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Documents the DGX Spark and DGX Station express install prompt and `NEMOCLAW_NO_EXPRESS`. - #3328 and #3329 -> `docs/security/best-practices.md`, `docs/deployment/sandbox-hardening.md`: Updates sandbox capability hardening docs for the stricter bounding-set and `setpriv` step-down behavior. - #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`: Documents Windows-host Ollama relaunch behavior, NIM key passthrough, early health-fail diagnostics, and mixed-GPU preflight detail. - #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 -> `docs/about/release-notes.md`: Adds the v0.0.39 release-prep section while keeping the v0.0.38 release notes intact. - Advances the release-prep docs metadata from v0.0.38 to v0.0.39. - Regenerates `.agents/skills/nemoclaw-user-*` from the updated source docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes v0.0.39 * **New Features** * Host alias management commands for easier configuration * Sandbox GPU control options during onboarding * Update command with check and confirmation modes * **Documentation** * Enhanced Linux installer guidance with Docker and group membership handling * Expanded troubleshooting for permission and connectivity issues * Improved capability-dropping security documentation * Updated inference model switching commands * Brev environment-specific troubleshooting * **Improvements** * DGX Spark/Station express install flow * Windows Ollama relay and health-check enhancements * NVIDIA NIM preflight GPU reporting [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nvidia-smi,/proccomm write, andcuInit(0)Tests
bash -n scripts/install-openshell.sh && bash -n test/e2e/test-gpu-e2e.shnpm run build:clinpm run typecheck:clinpx vitest run test/onboard.test.ts src/lib/onboard-command.test.ts test/install-openshell-version-check.test.ts src/lib/inventory-commands.test.tsnpm run check:credential-envgit diff --checkSummary by CodeRabbit
New Features
Improvements
Tests