Skip to content

fix(recovery): add connect probe recovery path#2646

Merged
ericksoa merged 25 commits into
mainfrom
fix/gateway-recovery-probe-observability
May 2, 2026
Merged

fix(recovery): add connect probe recovery path#2646
ericksoa merged 25 commits into
mainfrom
fix/gateway-recovery-probe-observability

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 28, 2026

Summary

  • adds nemoclaw <sandbox> connect --probe-only as a non-interactive gateway recovery probe, plus connect --help handling that does not require registry or OpenShell recovery
  • runs gateway health and recovery through openshell sandbox exec first, with SSH fallback, so recovery uses the same sandbox exec path as the adversarial tamper probe
  • centralizes OpenClaw recovery script generation and prepares /tmp/gateway.log as the real gateway-owned log before writing [gateway-recovery] WARNING and relaunching via gosu gateway when available
  • updates the [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 e2e to use connect --probe-only, assert the warning, and verify a negative-case respawn

Verification

  • npm run build:cli
  • npx vitest run src/lib/agent-runtime.test.ts test/nemoclaw-cli-recovery.test.ts test/cli.test.ts -t "connect --probe-only|shows connect help|buildRecoveryScript|nemoclaw CLI runtime recovery"
  • npm run typecheck:cli
  • tmp_home=$(mktemp -d); HOME="$tmp_home" node bin/nemoclaw.js alpha connect --help
  • git diff --check
  • npx prettier --check src/nemoclaw.ts src/lib/agent-runtime.ts src/lib/agent-runtime.test.ts src/lib/command-registry.ts test/cli.test.ts
  • npx eslint src/nemoclaw.ts src/lib/agent-runtime.ts src/lib/agent-runtime.test.ts src/lib/command-registry.ts test/cli.test.ts (exit 0; TS files ignored by local ESLint config warnings)

Local full-hook notes

  • npm run format:check -- ... still runs the repo-wide configured globs and fails on pre-existing formatting issues in unrelated test files. Direct Prettier check on touched files passes.
  • The full pre-commit/pre-push CLI test hook times out locally in unrelated tests: test/onboard.test.ts and test/sandbox-build-context.test.ts; after building nemoclaw/dist, the focused regression tests above pass.
  • Initial SSH push was blocked by NVIDIA SAML for the local SSH key; branch was pushed through the authenticated HTTPS GitHub CLI path.

Summary by CodeRabbit

  • New Features

    • Added --probe-only connect mode, implicit connect handling, and OpenClaw-aware recovery with safer gateway start/log selection.
  • Bug Fixes

    • Hardened gateway log preparation (no symlink-following, append redirection, fd-safe permission/ownership) and deterministic exec-based sandbox probing with SSH fallback.
    • Simplified/stricter agent-binary verification and improved cross-platform config-owner detection.
  • Chores

    • Expanded tests and e2e scripts; updated permission-mode expectations and added setgid handling for locking.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Centralizes hardened gateway log setup (no-follow FD init and selection), extracts gateway launch wiring, adds exec-based sandbox probing and a --probe-only connect mode, exports buildOpenClawRecoveryScript(port), and updates CLI, probes, and tests to exercise exec-first recovery with SSH fallback.

Changes

OpenClaw recovery, exec probing, and probe-only connect

Layer / File(s) Summary
Data / Constants
src/nemoclaw.ts
Adds SANDBOX_EXEC_STARTED_MARKER constant to demarcate exec probe stdout.
Probe Implementation
src/nemoclaw.ts
Adds executeSandboxExecCommand() (exec probe that expects marker) and parseSandboxGatewayProbe(); exec returns null when marker absent; SSH probe returns null when ssh-config output is empty.
Recovery Control Flow
src/nemoclaw.ts
isSandboxGatewayRunning() prefers exec probe and falls back to SSH only when exec is unparseable; recoverSandboxProcesses() uses exec-first recovery for OpenClaw (30s timeout, marker-driven) and SSH-only for non-OpenClaw.
Recovery Script Generation
src/lib/agent-runtime.ts
Adds buildNoFollowLogSetupCommand, buildGatewayLogSetup, buildGatewayLogSelection, and gatewayLaunchCommand; introduces exported buildOpenClawRecoveryScript(port) and refactors buildRecoveryScript to reuse shared log setup/selection and launch wiring.
CLI Connect Integration
src/nemoclaw.ts, src/lib/command-registry.ts
Adds --probe-only parsing/helpers (SandboxConnectOptions, parseSandboxConnectArgs, printSandboxConnectHelp), runSandboxConnectProbe, wires probeOnly into sandboxConnect, treats leading --probe-only/--help/-h as implicit connect, and documents --probe-only in COMMANDS.
Tests: Recovery & Log Expectations
src/lib/agent-runtime.test.ts, test/cli.test.ts, test/e2e/test-issue-2478-crash-loop-recovery.sh
Tests updated to assert no-follow log initialization (O_NOFOLLOW), fd-based os.open/os.fchown/os.fchmod and gosu usage, _GATEWAY_LOG selection/fallback, fail-fast on log-setup failure, append redirection to $_GATEWAY_LOG, and probe-only exec/SSH fallback behaviors.

Shields, tooling, and test harness updates

Layer / File(s) Summary
Shields / Permission Mode Changes
src/lib/shields.ts, test/e2e/test-shields-config.sh, test/shields.test.ts
applyStateDirLockMode and lockAgentConfig now attempt chmod g-s (strip setgid) when locking; tests and e2e scripts updated to accept/expect group-writable mutable defaults (660/2770).
Test harness & mocks
test/repro-2681-group-writable.test.ts, test/cli.test.ts, test/nemoclaw-start.test.ts
Adds withMockedDockerExecFileSync() using vi.fn for in-process mocking, extends writeSandboxRegistry to accept sandboxOverrides, adjusts readiness polling to be kind-dependent, and updates multiple test stubs and expectations.
Shell portability
scripts/nemoclaw-start.sh
normalize_mutable_config_perms() queries owner using stat -c '%U' with a fallback to stat -f '%Su' for BSD/macOS compatibility.
Dockerfile parsing helper
test/fetch-guard-patch-regression.test.ts
dockerRunCommandBetween() now collapses \\\n continuations into spaces and trims trailing continuation backslashes when extracting RUN content.
Agent onboarding verifier
src/lib/agent-onboard.ts, src/lib/agent-onboard.test.ts
Exports verifyAgentBinaryAvailable(...); sandbox-side checks now verify -e and -x for configured binary_path and removes path_mismatch handling; tests added/updated.
Small CLI metadata
src/lib/command-registry.ts
Documents --probe-only flag for nemoclaw <name> connect in the command registry.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI
    participant RecoverFn
    participant ExecProbe
    participant SSHProbe
    participant Agent

    User->>CLI: nemoclaw alpha connect --probe-only
    CLI->>RecoverFn: runSandboxConnectProbe(quiet:true)
    RecoverFn->>ExecProbe: isSandboxGatewayRunning() via sandbox exec
    alt exec probe usable
        ExecProbe->>Agent: sandbox exec probe (marker-prefixed)
        Agent-->>ExecProbe: marker + RUNNING/STOPPED
        ExecProbe-->>RecoverFn: parsed result
    else exec probe unparseable (null)
        RecoverFn->>RecoverFn: attempt exec-based recovery (buildOpenClawRecoveryScript)
        RecoverFn->>Agent: sandbox exec run recovery script (log setup, launch)
        Agent-->>RecoverFn: script outcome
        alt exec recovery returned null / unparseable
            RecoverFn->>SSHProbe: fallback to SSH recovery (ssh-config + ssh)
            SSHProbe->>Agent: run recovery script over SSH
            Agent-->>SSHProbe: probe/marker or output
        end
    end
    RecoverFn-->>CLI: success/failure
    CLI-->>User: exit code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

priority: high

Suggested reviewers

  • jyaunches

Poem

🐰 I nudge the logs with careful paws,

Exec-first probes and guarded laws,
No symlink snares, permissions tamed,
Gateway wakes and crashes maimed,
A happy rabbit hops—recovery praised!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(recovery): add connect probe recovery path' directly describes the main addition to the codebase: a new recovery probe feature via nemoclaw connect --probe-only. It is specific, concise, and accurately reflects the primary objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gateway-recovery-probe-observability

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa self-assigned this Apr 28, 2026
@ericksoa ericksoa added bug Something isn't working Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. v0.0.29 Release target labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 51-56: The buildGatewayLogSetup helper always assumes the
"gateway" runtime user and chowns /tmp/gateway.log and later uses gosu gateway,
which incorrectly changes UID for non-OpenClaw recovery flows; modify the call
sites so buildGatewayLogSetup no longer hardcodes the user: update
buildGatewayLogSetup to accept a user parameter (e.g., logOwnerUser) and only
pass "gateway" from buildOpenClawRecoveryScript(), while leaving
buildRecoveryScript() to pass the explicit recovery user (or null/empty to skip
chown/gosu). Search for buildGatewayLogSetup, buildRecoveryScript, and
buildOpenClawRecoveryScript to update signatures and call sites and ensure the
chown/gosu steps only run when a non-empty user is provided.

In `@src/nemoclaw.ts`:
- Around line 1430-1451: The parser currently accepts both "--probe-only" and
"--dangerously-skip-permissions" but probeOnly path skips applying permissions,
so update parseSandboxConnectArgs to detect the conflicting combination
(options.probeOnly && options.dangerouslySkipPermissions) and fail fast: print a
clear error message, call printSandboxConnectHelp(sandboxName), and
process.exit(1); alternatively, if you prefer to honor the permission change,
perform the permission switch before any early return in the code path that
handles probeOnly; reference parseSandboxConnectArgs, the option flags
"--probe-only" and "--dangerously-skip-permissions", and printSandboxConnectHelp
when implementing the change.
- Around line 3953-3962: The help-detection only triggers when
requestedSandboxAction === "connect", so "nemoclaw <sandbox> --help" (where
args[0] is the sandbox name and args[1] is "--help") falls through; update the
condition that calls validateName(cmd, "sandbox name") and
printSandboxConnectHelp(cmd) to also detect help flags when they appear after a
sandbox name by checking requestedSandboxActionArgs (e.g., if
requestedSandboxAction === "connect" || requestedSandboxActionArgs.some(arg =>
arg === "--help" || arg === "-h")), so that the help path runs for both explicit
"connect" and the implicit "<sandbox> --help" cases while still using the
existing validateName and printSandboxConnectHelp calls.
- Around line 324-326: The current logic calls the SSH fallback whenever
recovered(execResult) is false, which can re-run the same script even when
execResult is non-null (i.e., the script ran and failed); change the flow in the
block using executeSandboxExecCommand, execResult and executeSandboxCommand so
that: 1) if recovered(execResult) return true; 2) if execResult !== null return
false (do not fallback to SSH because the exec completed); 3) otherwise
(execResult is null/indeterminate) call
recovered(executeSandboxCommand(sandboxName, script)) to attempt the SSH
fallback. Ensure you reference executeSandboxExecCommand, execResult, recovered
and executeSandboxCommand when making the change.

In `@test/cli.test.ts`:
- Around line 1085-1093: Update the test that calls runWithEnv("alpha connect
--help", ...) to stub any external "openshell" invocation: create a temporary
directory, place a small executable named "openshell" there that fails/records
invocations (e.g., exits non‑zero or writes a marker file), prepend that temp
dir to PATH in the environment passed to runWithEnv, run the command, then
assert r.code === 0 and r.out contains the help text and additionally assert
that the stubbed "openshell" was never invoked (e.g., marker file does not exist
or the stub was not executed); reference the existing test block (the it(...)
titled "shows connect help without opening an interactive session"), runWithEnv,
and the "openshell" executable name when making changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a57982b-f8a0-4085-8795-6f52b2d38f23

📥 Commits

Reviewing files that changed from the base of the PR and between 1f228a5 and c36db00.

📒 Files selected for processing (6)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/command-registry.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread src/lib/agent-runtime.ts Outdated
Comment thread src/nemoclaw.ts Outdated
Comment thread src/nemoclaw.ts
Comment thread src/nemoclaw.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/cli.test.ts`:
- Around line 1153-1165: The case branch matcher for the recovery script
currently looks for an escaped fragment (OPENCLAW=\"$(command -v openclaw)\"),
which never matches the generated script that uses OPENCLAW="$(command -v
openclaw)"; update the case pattern (the line matching OPENCLAW in the stub) so
it matches the actual emitted string (e.g., remove the unnecessary
backslash-escaping around the quotes or otherwise match OPENCLAW="$(command -v
openclaw)"), ensuring the recovery branch that writes to state_file and echoes
GATEWAY_PID runs during the probe-only success test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffe40201-a783-4702-b927-7030e82cd218

📥 Commits

Reviewing files that changed from the base of the PR and between c36db00 and 5b47a6b.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/agent-runtime.test.ts
  • src/nemoclaw.ts

Comment thread test/cli.test.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 51-58: The current buildGatewayLogSetup truncates and chowns
/tmp/gateway.log before checking for symlinks, allowing a sandbox user to
preplace a symlink and cause clobber/permission changes; update
buildGatewayLogSetup to either use a root-owned path (e.g., /var/log/gateway.log
or another root-controlled directory) instead of /tmp, or add an explicit
symlink rejection before any truncation/chown (for example test -L
/tmp/gateway.log || test -h /tmp/gateway.log and bail with an error if true),
and only proceed to ": > …", touch, chown, and chmod after verifying the file is
not a symlink; ensure the chmodGatewayLog and the initial ": > /tmp/gateway.log
..." line in buildGatewayLogSetup are changed to perform the symlink check first
or use the new root-owned path.

In `@src/nemoclaw.ts`:
- Around line 278-282: The early check returning on result.error discards stdout
and can miss SANDBOX_EXEC_STARTED_MARKER; in the function handling the openshell
sandbox exec result (look for variables result, stdout, stdoutLines and the
constant SANDBOX_EXEC_STARTED_MARKER and callers like recoverSandboxProcesses),
parse and trim result.stdout, split into stdoutLines and check for the
SANDBOX_EXEC_STARTED_MARKER first, returning null only if the marker is absent,
and then check result.error afterwards so marker detection is preserved even if
the command later errors or times out.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Line 445: The test currently masks failures of the probe command by appending
"|| true" to the timeout 60 nemoclaw "$SANDBOX_NAME" connect --probe-only
invocation (also at the other two occurrences), which hides regressions; remove
the "|| true", run the command capturing stdout/stderr into a variable or temp
file, check its exit status and if non-zero fail the test the same way as the
main crash-recovery loop, and when failing print/dump the captured probe output
so the failure is visible for debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 71bf74d4-7544-4472-a3fb-27fe2c895652

📥 Commits

Reviewing files that changed from the base of the PR and between 5b47a6b and 6e4d953.

📒 Files selected for processing (5)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/agent-runtime.test.ts
  • test/cli.test.ts

Comment thread src/nemoclaw.ts Outdated
Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh Outdated
@cv cv added v0.0.30 Release target and removed v0.0.29 Release target labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/nemoclaw.ts (1)

271-292: ⚠️ Potential issue | 🟠 Major

Don't collapse a marked exec to null on timeout/error.

If spawnSync() times out after emitting SANDBOX_EXEC_STARTED_MARKER, Line 282 turns that run into null, and recoverSandboxProcesses() will treat it as indeterminate and replay the recovery script over SSH.

Suggested fix
-    if (result.error) return null;
     const stdout = (result.stdout || "").trim();
     const stdoutLines = stdout.split(/\r?\n/);
     const markerIndex = stdoutLines.indexOf(SANDBOX_EXEC_STARTED_MARKER);
     if (markerIndex === -1) return null;
     const commandStdoutLines = stdoutLines.slice(markerIndex + 1);
     return {
       status: result.status ?? 1,
       stdout: commandStdoutLines.join("\n").trim(),
       stderr: (result.stderr || "").trim(),
     };

Run this to verify that spawnSync() can preserve partial stdout even when result.error is set:

#!/bin/bash
set -euo pipefail
node - <<'NODE'
const { spawnSync } = require("node:child_process");

const result = spawnSync(
  "sh",
  ["-c", "printf '__NEMOCLAW_SANDBOX_EXEC_STARTED__\\n'; sleep 2"],
  { encoding: "utf8", timeout: 200 }
);

console.log(JSON.stringify({
  hasError: Boolean(result.error),
  status: result.status,
  signal: result.signal,
  stdout: result.stdout,
  stderr: result.stderr
}, null, 2));
NODE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 271 - 292, The current spawnSync block in
src/nemoclaw.ts returns null whenever result.error is truthy, which discards
partial stdout even if SANDBOX_EXEC_STARTED_MARKER was emitted; update the logic
around the spawnSync call (the block that checks result.error, stdout,
stdoutLines, markerIndex) to first inspect result.stdout for
SANDBOX_EXEC_STARTED_MARKER before bailing on errors/timeouts, i.e., parse
result.stdout (trim/split) and if markerIndex !== -1 proceed to build and return
the result object (status, stdout from marker+1, stderr) even when result.error
exists; only return null if markerIndex === -1 or stdout lacks the marker.
🧹 Nitpick comments (2)
src/nemoclaw.ts (2)

4372-4376: Consider normalizing implicit connect flags, not just help.

You special-case -h/--help as implicit connect, but nemoclaw <sandbox> --probe-only still goes through the generic action switch and becomes Unknown action. Routing leading flag-only invocations through parseSandboxConnectArgs() would make the implicit-connect UX consistent.

Also applies to: 4409-4415

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 4372 - 4376, The code currently treats only
-h/--help as an implicit "connect" by using firstSandboxArg and
implicitConnectHelp; change the logic so any leading flag (argument starting
with "-") is considered an implicit connect action and its args are routed into
parseSandboxConnectArgs() instead of the generic action switch. Update the
computation of requestedSandboxAction and requestedSandboxActionArgs (the
variables firstSandboxArg, implicitConnectHelp, requestedSandboxAction,
requestedSandboxActionArgs) to detect firstSandboxArg?.startsWith("-") (or
similar) and set action to "connect" and args to args (not args.slice(1)), then
ensure the code path calls parseSandboxConnectArgs() for that case. This keeps
the UX consistent for flag-only invocations like "--probe-only".

1571-1601: Extract the probe-only flow out of sandboxConnect().

This adds another early-return branch to one of the busiest functions in the file. Moving the probeOnly path into a dedicated helper would make the normal connect flow easier to follow and keep the function closer to the complexity budget.

As per coding guidelines, "**/*.{js,ts,tsx}: Cyclomatic complexity must not exceed 20 (with intent to ratchet down to 15) as enforced by ESLint`."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1571 - 1601, Extract the probe-only branch
inside sandboxConnect into a new helper function (e.g., probeSandboxGateway)
that accepts (sandboxName: string) and the same options (or just probeOnly flag)
and encapsulates the logic using checkAndRecoverSandboxProcesses,
agentRuntime.getSessionAgent, agentRuntime.getAgentDisplayName and the
process.exit/log calls; then replace the inline probeOnly block in
sandboxConnect with an early return that calls the new
probeSandboxGateway(sandboxName) helper. Ensure SandboxConnectOptions
(dangerouslySkipPermissions/probeOnly) usage stays the same and that any
required requires/imports remain accessible to the new helper so behavior is
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/nemoclaw.ts`:
- Around line 271-292: The current spawnSync block in src/nemoclaw.ts returns
null whenever result.error is truthy, which discards partial stdout even if
SANDBOX_EXEC_STARTED_MARKER was emitted; update the logic around the spawnSync
call (the block that checks result.error, stdout, stdoutLines, markerIndex) to
first inspect result.stdout for SANDBOX_EXEC_STARTED_MARKER before bailing on
errors/timeouts, i.e., parse result.stdout (trim/split) and if markerIndex !==
-1 proceed to build and return the result object (status, stdout from marker+1,
stderr) even when result.error exists; only return null if markerIndex === -1 or
stdout lacks the marker.

---

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 4372-4376: The code currently treats only -h/--help as an implicit
"connect" by using firstSandboxArg and implicitConnectHelp; change the logic so
any leading flag (argument starting with "-") is considered an implicit connect
action and its args are routed into parseSandboxConnectArgs() instead of the
generic action switch. Update the computation of requestedSandboxAction and
requestedSandboxActionArgs (the variables firstSandboxArg, implicitConnectHelp,
requestedSandboxAction, requestedSandboxActionArgs) to detect
firstSandboxArg?.startsWith("-") (or similar) and set action to "connect" and
args to args (not args.slice(1)), then ensure the code path calls
parseSandboxConnectArgs() for that case. This keeps the UX consistent for
flag-only invocations like "--probe-only".
- Around line 1571-1601: Extract the probe-only branch inside sandboxConnect
into a new helper function (e.g., probeSandboxGateway) that accepts
(sandboxName: string) and the same options (or just probeOnly flag) and
encapsulates the logic using checkAndRecoverSandboxProcesses,
agentRuntime.getSessionAgent, agentRuntime.getAgentDisplayName and the
process.exit/log calls; then replace the inline probeOnly block in
sandboxConnect with an early return that calls the new
probeSandboxGateway(sandboxName) helper. Ensure SandboxConnectOptions
(dangerouslySkipPermissions/probeOnly) usage stays the same and that any
required requires/imports remain accessible to the new helper so behavior is
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44020cdc-064d-4ef5-aa2d-51aa9223fce9

📥 Commits

Reviewing files that changed from the base of the PR and between 6e4d953 and 42ee78e.

📒 Files selected for processing (2)
  • src/nemoclaw.ts
  • test/cli.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/cli.test.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 51-61: The TOCTOU symlink vulnerability in buildGatewayLogSetup
stems from the check-then-use pattern around /tmp/gateway.log (see
gatewayLogSymlinkGuard and subsequent lines array entries); fix by avoiding the
race: either move the log to a root-owned directory (e.g., under /var/run or
/var/log) and update buildGatewayLogSetup to reference that path, or open/create
the file atomically with a no-follow/open-with-O_NOFOLLOW equivalent (i.e.,
create with flags that fail on symlink) and then operate on the obtained
descriptor before any checks like chmodGatewayLog; ensure all uses in the lines
array perform operations on the safely-created path/descriptor rather than
relying on the initial symlink check.

In `@src/nemoclaw.ts`:
- Around line 328-338: The recovery probe conflates transport timeouts with
execution failure by requiring result.status === 0 when checking markers; update
the logic so that the exec probe (executeSandboxExecCommand) considers stdout
markers ("GATEWAY_PID=" or "ALREADY_RUNNING") sufficient for success even if
status !== 0 (or null due to spawn timeout), while retaining the strict status
=== 0 requirement for the SSH fallback (executeSandboxCommand). Concretely,
adjust or add a helper (e.g., create recoveredExec vs recoveredSSH) to detect
markers from SandboxCommandResult.stdout for execResult without requiring status
=== 0, use that to return true early if markers exist, otherwise if execResult
is non-null treat as failure, and only then call
executeSandboxCommand(sandboxName, script) and validate it with the original
status === 0 + marker check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 882e075f-1d8b-45a9-9c93-3db95431ea2e

📥 Commits

Reviewing files that changed from the base of the PR and between 42ee78e and 6f1c0e1.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread src/lib/agent-runtime.ts Outdated
Comment thread src/nemoclaw.ts Outdated
@cv cv added v0.0.31 Release target and removed v0.0.30 Release target labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/agent-runtime.ts (1)

72-83: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden ownerless root setup to close the remaining log-path race

On Line 72 and Lines 81-82, when recovery runs as root with no logOwnerUser, the code chmods the opened fd but does not change ownership. If /tmp/gateway.log was sandbox-owned, the sandbox user can still swap the path before later >> /tmp/gateway.log redirects, reintroducing a TOCTOU window in non-OpenClaw recovery.

Suggested patch
-    "    else:",
-    "        os.fchmod(fd, 0o600)",
+    "    else:",
+    "        if os.geteuid() == 0:",
+    "            os.fchown(fd, 0, 0)",
+    "        os.fchmod(fd, 0o600)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/agent-runtime.ts` around lines 72 - 83, When recovery runs as root
and no logOwnerUser is set (the branch checking "if owner and os.geteuid() ==
0:" else), the code only fchmods the opened fd which leaves the file owned by a
sandbox user and opens a TOCTOU for later redirection; modify the else branch
for the root/no-owner case to call os.fchown(fd, 0, 0) (set ownership to root)
before os.fchmod(fd, 0o600) so the sandbox user can no longer swap the path;
locate the block that references owner, pw, gr and fd and add the fchown call in
the branch that currently only calls os.fchmod(fd, 0o600).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/agent-runtime.ts`:
- Around line 72-83: When recovery runs as root and no logOwnerUser is set (the
branch checking "if owner and os.geteuid() == 0:" else), the code only fchmods
the opened fd which leaves the file owned by a sandbox user and opens a TOCTOU
for later redirection; modify the else branch for the root/no-owner case to call
os.fchown(fd, 0, 0) (set ownership to root) before os.fchmod(fd, 0o600) so the
sandbox user can no longer swap the path; locate the block that references
owner, pw, gr and fd and add the fchown call in the branch that currently only
calls os.fchmod(fd, 0o600).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24b8d7c4-0e42-4b07-9eab-cd6154668d4f

📥 Commits

Reviewing files that changed from the base of the PR and between bdda795 and be9e8eb.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/cli.test.ts

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 95-100: The sequence that builds shell commands for
"/tmp/auto-pair.log" (the lines.push entries constructing
rm/redirect/chown/chmod) is vulnerable to TOCTOU/symlink attacks; update the
code that emits these commands (the lines.push block for /tmp/auto-pair.log in
src/lib/agent-runtime.ts) to harden the file the same way as "/tmp/gateway.log"
does: either create/open the file with O_NOFOLLOW and use fchmod/fchown on the
file descriptor or move the file into a root-owned directory and operate on it
there, reusing the exact safe pattern/commands used for "/tmp/gateway.log" so
you avoid rm/redirect/chown on a world-writable path.

In `@src/nemoclaw.ts`:
- Around line 4391-4395: The code that computes firstSandboxArg /
implicitConnectHelp / requestedSandboxAction / requestedSandboxActionArgs must
also recognize a leading connect-specific flag `--probe-only` (and
similarly-handled flags at the other block around requestedSandboxAction usage)
so that invocations like `nemoclaw <sandbox> --probe-only` are treated as the
implicit "connect" action rather than falling through to "Unknown action".
Update the logic that sets implicitConnectHelp to true when firstSandboxArg
equals "--help" or "-h" to also treat "--probe-only" (or explicitly reject it
with a connect-specific usage) and ensure requestedSandboxAction is set to
"connect" and requestedSandboxActionArgs includes the original args; repeat the
same change in the analogous block referenced around the other occurrence so
both paths behave consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f508d499-7080-465c-85ae-25e47da81ad7

📥 Commits

Reviewing files that changed from the base of the PR and between be9e8eb and bb77fd1.

📒 Files selected for processing (5)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/command-registry.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
✅ Files skipped from review due to trivial changes (2)
  • src/lib/command-registry.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/agent-runtime.test.ts

Comment thread src/lib/agent-runtime.ts
Comment thread src/nemoclaw.ts Outdated
@cv cv added v0.0.32 Release target and removed v0.0.31 Release target labels Apr 30, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/cli.test.ts (1)

1547-1554: ⚡ Quick win

Assert that SSH fallback stays non-interactive.

Lines 1549-1553 only prove that the fallback hit sandbox exec, sandbox ssh-config, and an SSH recovery command. Because the openshell stub returns 0 for any other subcommand, this test would still pass if --probe-only accidentally continued into sandbox connect alpha afterward. Add a negative assertion on the recorded OpenShell calls so this actually guards the probe-only contract.

Suggested hardening
     expect(r.code).toBe(0);
     expect(r.out).toContain("Probe complete: recovered OpenClaw gateway");
     const openshellLog = fs.readFileSync(openshellCalls, "utf8");
     const sshLog = fs.readFileSync(sshCalls, "utf8");
     expect(openshellLog).toContain("sandbox exec --name alpha -- sh -c");
     expect(openshellLog).toContain("sandbox ssh-config alpha");
+    expect(openshellLog).not.toContain("sandbox connect alpha");
     expect(sshLog).toContain('OPENCLAW="$(command -v openclaw)"');
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 1547 - 1554, The test currently only asserts
that probe steps ran but doesn't guard against openshell accidentally being
invoked interactively afterward; update the test (around where openshellLog and
sshLog are read) to also assert that openshellLog does NOT contain the
interactive connection string (e.g. "sandbox connect alpha" or "sandbox
connect") so that the --probe-only flow remains non-interactive—add a negative
expectation against openshellLog (and/or ensure no "connect" invocation was
recorded) alongside the existing positive assertions on "sandbox exec --name
alpha -- sh -c" and "sandbox ssh-config alpha".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 79-87: The current recovery path uses grp.getgrnam(owner) which
fails when the user's primary group has a different name; update the logic
around owner/pw/gr retrieval to use the user's primary GID from
pwd.getpwnam(owner) (pw.pw_gid) instead of attempting grp.getgrnam(owner), then
call os.fchown(fd, pw.pw_uid, pw.pw_gid) and os.fchmod(fd, owner_mode) on
success, falling back to os.fchmod(fd, fallback_mode) only if
pwd.getpwnam(owner) raises KeyError or other lookup errors; adjust the block
surrounding pwd.getpwnam(owner), grp.getgrnam(owner), os.fchown, os.fchmod,
fallback_mode, owner_mode, and fd accordingly.

In `@src/nemoclaw.ts`:
- Around line 340-352: The recovery path can cause UID drift because
non-OpenClaw scripts built by buildRecoveryScript() call
gatewayLaunchCommand(command) with no runAsUser, while
buildOpenClawRecoveryScript() uses gatewayLaunchCommand(command, "gateway");
update the code so either (preferred) AgentDefinition carries a runtime user
hint (e.g., runtimeUser) and have buildRecoveryScript() pass that hint into
gatewayLaunchCommand(command, runtimeUser) so exec launches use the correct
priv-drop, or (alternative) block the exec path for non-OpenClaw agents by
returning null/avoiding executeSandboxExecCommand(sandboxName, script) and
falling back to executeSandboxCommand(sandboxName, script) only, until
AgentDefinition has the runtime user metadata; adjust uses of
buildRecoveryScript, buildOpenClawRecoveryScript, gatewayLaunchCommand,
executeSandboxExecCommand, and executeSandboxCommand accordingly.

---

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 1547-1554: The test currently only asserts that probe steps ran
but doesn't guard against openshell accidentally being invoked interactively
afterward; update the test (around where openshellLog and sshLog are read) to
also assert that openshellLog does NOT contain the interactive connection string
(e.g. "sandbox connect alpha" or "sandbox connect") so that the --probe-only
flow remains non-interactive—add a negative expectation against openshellLog
(and/or ensure no "connect" invocation was recorded) alongside the existing
positive assertions on "sandbox exec --name alpha -- sh -c" and "sandbox
ssh-config alpha".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4bed5fb0-3a5a-45d6-97e9-919a5696d3a8

📥 Commits

Reviewing files that changed from the base of the PR and between d7d7266 and 716ffb4.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/cli.test.ts

Comment thread src/lib/agent-runtime.ts
Comment thread src/nemoclaw.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/nemoclaw.ts (1)

337-360: Run the recovery-focused E2E workflows before merge.

Given the connect/recovery dispatch changes, run the two targeted nightly jobs to validate gateway-restart and process-recovery behavior end-to-end:
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e

As per coding guidelines: “E2E test recommendation: sandbox-survival-e2e — gateway restart recovery; sandbox-operations-e2e — process recovery after gateway kill.”

Also applies to: 4461-4505

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 337 - 360, Run the two targeted nightly E2E
workflows to validate the connect/recovery changes affecting
recoverSandboxProcesses; specifically trigger the workflows for
sandbox-survival-e2e (gateway restart recovery) and sandbox-operations-e2e
(process recovery after gateway kill) using: gh workflow run nightly-e2e.yaml
--ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e, and verify
that recoverSandboxProcesses (and the calls to executeSandboxCommand and
executeSandboxExecCommand) behave correctly during gateway restart and
process-recovery scenarios; re-run until both jobs pass before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/nemoclaw.ts`:
- Around line 1587-1590: The usage string in printSandboxConnectHelp currently
hardcodes "nemoclaw"; update it to use the existing CLI_NAME constant instead so
help output respects branded/aliased launcher names—replace the literal
"nemoclaw" in the template used inside printSandboxConnectHelp(sandboxName) with
CLI_NAME while keeping sandboxName as the placeholder and preserving the
surrounding formatting and flags (e.g., connect [--probe-only]).

---

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 337-360: Run the two targeted nightly E2E workflows to validate
the connect/recovery changes affecting recoverSandboxProcesses; specifically
trigger the workflows for sandbox-survival-e2e (gateway restart recovery) and
sandbox-operations-e2e (process recovery after gateway kill) using: gh workflow
run nightly-e2e.yaml --ref <branch> -f
jobs=sandbox-survival-e2e,sandbox-operations-e2e, and verify that
recoverSandboxProcesses (and the calls to executeSandboxCommand and
executeSandboxExecCommand) behave correctly during gateway restart and
process-recovery scenarios; re-run until both jobs pass before merging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f60bba63-76ee-4fda-8afe-a4cf84a39721

📥 Commits

Reviewing files that changed from the base of the PR and between d7d7266 and ff86ed7.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/cli.test.ts
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts

Comment thread src/nemoclaw.ts
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25183652333
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 2 passed, 18 failed, 1 skipped

Job Result
cloud-e2e ❌ failure
cloud-inference-e2e ❌ failure
cloud-onboard-e2e ❌ failure
deployment-services-e2e ❌ failure
diagnostics-e2e ❌ failure
docs-validation-e2e ❌ failure
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ❌ failure
messaging-providers-e2e ❌ failure
network-policy-e2e ❌ failure
overlayfs-autofix-e2e ❌ failure
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ❌ failure
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ❌ failure
snapshot-commands-e2e ❌ failure
token-rotation-e2e ❌ failure
upgrade-stale-sandbox-e2e ❌ failure

Failed jobs: cloud-e2e, cloud-onboard-e2e, cloud-inference-e2e, skill-agent-e2e, docs-validation-e2e, messaging-providers-e2e, token-rotation-e2e, sandbox-survival-e2e, sandbox-operations-e2e, inference-routing-e2e, network-policy-e2e, deployment-services-e2e, diagnostics-e2e, snapshot-commands-e2e, shields-config-e2e, rebuild-openclaw-e2e, upgrade-stale-sandbox-e2e, overlayfs-autofix-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.test.ts`:
- Around line 132-136: The ordering check uses a generic prepare anchor
("os.open(path, flags, 0o644)") which can match other opens; instead locate the
unique gateway-log preparation by searching for the specific token that
references "/tmp/gateway.log" (e.g. an os.open call or any string literal
containing "/tmp/gateway.log") in the test's script variable, store its index
(replace or add a variable like gatewayPrepIdx) and assert gatewayPrepIdx is >=
0 and gatewayPrepIdx < warnIdx; update the assertions that currently use
prepareIdx to use this gatewayPrepIdx so the test proves the /tmp/gateway.log
prep happens before echo "$_W" >> /tmp/gateway.log.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f32a9009-15e1-4378-8fd4-5a3589a59cc9

📥 Commits

Reviewing files that changed from the base of the PR and between d7d7266 and d42527d.

📒 Files selected for processing (5)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/command-registry.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/command-registry.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/cli.test.ts
  • src/lib/agent-runtime.ts

Comment thread src/lib/agent-runtime.test.ts Outdated
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa force-pushed the fix/gateway-recovery-probe-observability branch from 26074f3 to 20d128b Compare April 30, 2026 19:41
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25185326425
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 1 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ❌ failure
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ⚠️ cancelled
rebuild-hermes-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: docs-validation-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/nemoclaw.ts (1)

338-360: Run the two sandbox recovery E2E jobs before merge.

Given the recovery-path changes here, run sandbox-survival-e2e and sandbox-operations-e2e to validate restart and post-kill recovery behavior end-to-end.

As per coding guidelines: "E2E test recommendation: sandbox-survival-e2e — gateway restart recovery; sandbox-operations-e2e — process recovery after gateway kill."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 338 - 360, The reviewer requests running
end-to-end tests to validate the recovery-path changes in
recoverSandboxProcesses; please run the sandbox-survival-e2e and
sandbox-operations-e2e jobs locally or in CI to verify gateway restart and
post-kill process recovery; if failures occur, reproduce with the exact code
paths exercised by agentRuntime.buildRecoveryScript /
buildOpenClawRecoveryScript and the calls to executeSandboxCommand and
executeSandboxExecCommand, capture logs and iterate until both E2E suites pass
before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/cli.test.ts`:
- Around line 1451-1461: Update the "rejects the removed skip-permissions
connect flag" test to stub the external probe path so the code short-circuits
before trying to probe openshell: create the same fake PATH/marker guard used in
the help regression (use writeSandboxRegistry or a tmp bin dir with a dummy
openshell marker) and pass that PATH into runWithEnv when invoking "alpha
connect --dangerously-skip-permissions", so the test asserts the exit code and
message without allowing the probe to run.

---

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 338-360: The reviewer requests running end-to-end tests to
validate the recovery-path changes in recoverSandboxProcesses; please run the
sandbox-survival-e2e and sandbox-operations-e2e jobs locally or in CI to verify
gateway restart and post-kill process recovery; if failures occur, reproduce
with the exact code paths exercised by agentRuntime.buildRecoveryScript /
buildOpenClawRecoveryScript and the calls to executeSandboxCommand and
executeSandboxExecCommand, capture logs and iterate until both E2E suites pass
before merging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5548744f-e53e-46fc-8cef-a5cf57fa0b59

📥 Commits

Reviewing files that changed from the base of the PR and between d42527d and 20d128b.

📒 Files selected for processing (6)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/command-registry.ts
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh
✅ Files skipped from review due to trivial changes (3)
  • src/lib/command-registry.ts
  • src/lib/agent-runtime.ts
  • src/lib/agent-runtime.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread test/cli.test.ts
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25186285024
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 19 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ❌ failure
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: docs-validation-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/repro-2681-group-writable.test.ts (1)

53-58: 💤 Low value

Consider using Record<string, string | undefined> instead of NodeJS.ProcessEnv to avoid ESLint warning.

The ESLint no-undef rule flags NodeJS because it's a TypeScript global namespace that ESLint doesn't recognize without TypeScript-aware configuration. While this works at runtime, using the equivalent type avoids the lint warning:

🔧 Suggested fix
 function runScript(
   tmpDir: string,
   lines: string[],
-  env: NodeJS.ProcessEnv = {},
+  env: Record<string, string | undefined> = {},
 ): { result: ReturnType<typeof spawnSync>; log: string } {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/repro-2681-group-writable.test.ts` around lines 53 - 58, The runScript
function's env parameter currently uses the NodeJS.ProcessEnv type which
triggers ESLint no-undef; change its type to Record<string, string | undefined>
(i.e., update the signature of runScript(tmpDir: string, lines: string[], env:
Record<string, string | undefined> = {})) so the linter no longer complains;
ensure any call sites that pass process.env are still accepted (process.env is
compatible with that record type).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/repro-2681-group-writable.test.ts`:
- Around line 53-58: The runScript function's env parameter currently uses the
NodeJS.ProcessEnv type which triggers ESLint no-undef; change its type to
Record<string, string | undefined> (i.e., update the signature of
runScript(tmpDir: string, lines: string[], env: Record<string, string |
undefined> = {})) so the linter no longer complains; ensure any call sites that
pass process.env are still accepted (process.env is compatible with that record
type).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 940e923e-95b2-41d6-a9bb-f7b023720cea

📥 Commits

Reviewing files that changed from the base of the PR and between d1b9145 and c4d0f63.

📒 Files selected for processing (5)
  • nemoclaw/eslint.config.mjs
  • scripts/nemoclaw-start.sh
  • test/fetch-guard-patch-regression.test.ts
  • test/nemoclaw-start.test.ts
  • test/repro-2681-group-writable.test.ts
✅ Files skipped from review due to trivial changes (1)
  • nemoclaw/eslint.config.mjs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25235923163
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 20 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ❌ failure
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: shields-config-e2e. Check run artifacts for logs.

cv and others added 5 commits May 1, 2026 16:07
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…probe-observability

# Conflicts:
#	nemoclaw/eslint.config.mjs
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…probe-observability

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	test/fetch-guard-patch-regression.test.ts
#	test/nemoclaw-start.test.ts
#	test/repro-2681-group-writable.test.ts
@ericksoa ericksoa requested a review from cv May 2, 2026 00:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/cli.test.ts (1)

1608-1664: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail-closed the “no SSH” assertions by stubbing ssh too.

Line 1608 and Line 1770 tests prove no interactive/probe path via openshell, but they still leave real ssh available on PATH. If code regresses into direct SSH, these tests can hit host binaries instead of failing deterministically. Add a dummy ssh marker and assert it was never called.

Suggested hardening pattern
   const markerFile = path.join(home, "openshell-calls");
+  const sshMarkerFile = path.join(home, "ssh-calls");
   fs.mkdirSync(localBin, { recursive: true });
   fs.writeFileSync(
     path.join(localBin, "openshell"),
@@
     { mode: 0o755 },
   );
+  fs.writeFileSync(
+    path.join(localBin, "ssh"),
+    [
+      "#!/usr/bin/env bash",
+      `printf '%s\\n' "$*" >> ${JSON.stringify(sshMarkerFile)}`,
+      "exit 98",
+    ].join("\n"),
+    { mode: 0o755 },
+  );
@@
   expect(fs.existsSync(markerFile)).toBe(false);
+  expect(fs.existsSync(sshMarkerFile)).toBe(false);

Also applies to: 1770-1815

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 1608 - 1664, The tests currently stub an
openshell binary but leave the real ssh on PATH; create a dummy ssh in the same
localBin (e.g., write a file path.join(localBin, "ssh") that appends its args to
a new sshMarkerFile and exits non-zero), include that ssh marker alongside the
existing openshell marker (use a variable like sshMarkerFile), and in both tests
add an assertion expect(fs.existsSync(sshMarkerFile)).toBe(false) (or read the
file and expect it to be empty/not created) after running runWithEnv to ensure
no direct SSH calls occurred; update the two affected it blocks ("shows connect
help..." and "rejects the removed skip-permissions...") to create the dummy ssh
and assert it was never invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/cli.test.ts`:
- Around line 1608-1664: The tests currently stub an openshell binary but leave
the real ssh on PATH; create a dummy ssh in the same localBin (e.g., write a
file path.join(localBin, "ssh") that appends its args to a new sshMarkerFile and
exits non-zero), include that ssh marker alongside the existing openshell marker
(use a variable like sshMarkerFile), and in both tests add an assertion
expect(fs.existsSync(sshMarkerFile)).toBe(false) (or read the file and expect it
to be empty/not created) after running runWithEnv to ensure no direct SSH calls
occurred; update the two affected it blocks ("shows connect help..." and
"rejects the removed skip-permissions...") to create the dummy ssh and assert it
was never invoked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb203d62-f5d8-4f64-ba8a-cc0ecea9d9f9

📥 Commits

Reviewing files that changed from the base of the PR and between 3bbf345 and 7cff948.

📒 Files selected for processing (6)
  • scripts/nemoclaw-start.sh
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/fetch-guard-patch-regression.test.ts
  • test/nemoclaw-start.test.ts
  • test/repro-2681-group-writable.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • scripts/nemoclaw-start.sh
  • test/fetch-guard-patch-regression.test.ts
  • src/nemoclaw.ts

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25239651165
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 16 passed, 3 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ❌ failure
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ❌ failure
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ❌ failure
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: cloud-onboard-e2e, hermes-e2e, shields-config-e2e. Check run artifacts for logs.

…probe-observability

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	test/nemoclaw-start.test.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25240138549
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 0 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⚠️ cancelled
gpu-e2e ⏭️ skipped
hermes-e2e ❌ failure
inference-routing-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ⚠️ cancelled
rebuild-hermes-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: hermes-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/agent-runtime.ts (1)

111-121: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Unsafe $_GATEWAY_LOG path reopens still allow /tmp path swapping.

The no-follow Python setup only protects the initial create/truncate. Every later : >> "$_GATEWAY_LOG", echo >> "$_GATEWAY_LOG", nohup ... >> "$_GATEWAY_LOG", and the fallback to /tmp/gateway-recovery.log reopens a pathname in /tmp without O_NOFOLLOW. In the OpenClaw path, gateway.log is deliberately handed to the gateway user, so another process with that uid can replace it before the root redirect runs, and the recovery-log fallback is never hardened at all.

Please keep the log under a root-owned directory or keep writing through a trusted pre-opened fd instead of reopening /tmp paths.

Also applies to: 136-142, 205-211

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/agent-runtime.ts` around lines 111 - 121, buildGatewayLogSelection
and gatewayLaunchCommand currently reopen /tmp paths for logging which allows
TOCTOU/filename-swap attacks; instead ensure the daemon writes to a
root-controlled location or to a pre-opened, trusted file descriptor. Modify
buildGatewayLogSelection to create and use a root-owned directory (e.g.,
/var/log/<service>) and create/truncate the log there with safe flags, or open
the logfile once as a trusted FD and use that FD in all subsequent redirects
(nohup and echoes) so no later shell redirection reopens an untrusted path;
update gatewayLaunchCommand (and the similar blocks around the other reported
spots) to reference that secure path or the /proc/self/fd/<n> handle rather than
reopening $_GATEWAY_LOG in /tmp. Ensure ownership/permissions are set so
non-root users cannot swap the file and preserve existing fallback behavior by
hardening the fallback file the same way.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/shields.ts`:
- Around line 281-285: The current sequence in kubectlExec calls
(kubectlExec(sandboxName, ["chmod", dirMode, dirPath]); if (isLocking) {
kubectlExec(sandboxName, ["chmod", "g-s", dirPath]); } kubectlExec(sandboxName,
["chmod", "-R", writeStrip, dirPath]);) risks skipping the mandatory recursive
chmod when the intermediate g-s fails; update the logic so the recursive
write-strip chmod always runs regardless of errors from the optional g-s
step—e.g., perform kubectlExec(... dirMode ...) first, then if (isLocking) call
kubectlExec(... "g-s" ...) inside its own try/catch that swallows/logs errors,
and ensure kubectlExec(... "-R", writeStrip, dirPath) is executed
unconditionally afterwards (or placed in a finally block) so kubectlExec,
sandboxName, dirMode, dirPath, isLocking and writeStrip are used as identifiers
to find and change the code.

---

Duplicate comments:
In `@src/lib/agent-runtime.ts`:
- Around line 111-121: buildGatewayLogSelection and gatewayLaunchCommand
currently reopen /tmp paths for logging which allows TOCTOU/filename-swap
attacks; instead ensure the daemon writes to a root-controlled location or to a
pre-opened, trusted file descriptor. Modify buildGatewayLogSelection to create
and use a root-owned directory (e.g., /var/log/<service>) and create/truncate
the log there with safe flags, or open the logfile once as a trusted FD and use
that FD in all subsequent redirects (nohup and echoes) so no later shell
redirection reopens an untrusted path; update gatewayLaunchCommand (and the
similar blocks around the other reported spots) to reference that secure path or
the /proc/self/fd/<n> handle rather than reopening $_GATEWAY_LOG in /tmp. Ensure
ownership/permissions are set so non-root users cannot swap the file and
preserve existing fallback behavior by hardening the fallback file the same way.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a4d3863-b561-48fd-888b-2c2491b7e9ea

📥 Commits

Reviewing files that changed from the base of the PR and between 1d137c6 and 61bb845.

📒 Files selected for processing (7)
  • src/lib/agent-onboard.test.ts
  • src/lib/agent-onboard.ts
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/lib/shields.ts
  • test/e2e/test-shields-config.sh
  • test/shields.test.ts

Comment thread src/lib/shields.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25240236323
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 0 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⚠️ cancelled
gpu-e2e ⏭️ skipped
hermes-e2e ❌ failure
inference-routing-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ⚠️ cancelled
rebuild-hermes-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: hermes-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/agent-onboard.ts (1)

156-173: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the configured path in not_found failures.

Line 161 preserves binaryPath, but the fallback at Line 173 still reports only path.basename(...). When the configured path is wrong, the error drops the exact value the user needs to fix.

Suggested fix
 function describeAgentBinaryFailure(
   sandboxName: string,
   agent: AgentDefinition,
   result: Exclude<AgentBinaryAvailability, { available: true }>,
 ): string {
   const executable = agentExecutableName(agent);
   if (result.reason === "not_executable") {
     return `${agent.displayName} configured binary '${result.binaryPath}' is not executable inside sandbox '${sandboxName}'`;
   }
-  return `${agent.displayName} binary '${executable}' is missing inside sandbox '${sandboxName}'`;
+  const missingTarget = result.binaryPath ?? executable;
+  return `${agent.displayName} binary '${missingTarget}' is missing inside sandbox '${sandboxName}'`;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/agent-onboard.ts` around lines 156 - 173, The not_found error message
in describeAgentBinaryFailure currently shows only the basename via
agentExecutableName (variable executable) which hides the configured path;
update describeAgentBinaryFailure to prefer and display result.binaryPath when
present for result.reason === "not_found" (falling back to
agentExecutableName/executable if undefined) so the user sees the exact
configured path they provided; reference the function
describeAgentBinaryFailure, the result parameter (AgentBinaryAvailability), and
agentExecutableName to locate and change the message construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/lib/agent-onboard.ts`:
- Around line 156-173: The not_found error message in describeAgentBinaryFailure
currently shows only the basename via agentExecutableName (variable executable)
which hides the configured path; update describeAgentBinaryFailure to prefer and
display result.binaryPath when present for result.reason === "not_found"
(falling back to agentExecutableName/executable if undefined) so the user sees
the exact configured path they provided; reference the function
describeAgentBinaryFailure, the result parameter (AgentBinaryAvailability), and
agentExecutableName to locate and change the message construction.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8c00333-19cc-400b-b2e1-d1ce0bf7d3f6

📥 Commits

Reviewing files that changed from the base of the PR and between b8e27d5 and 69a2656.

📒 Files selected for processing (2)
  • src/lib/agent-onboard.test.ts
  • src/lib/agent-onboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/agent-onboard.test.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ❌ Some jobs failed

Run: 25240327545
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 18 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ❌ failure
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: hermes-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25240829338
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 20 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25241862383
Branch: fix/gateway-recovery-probe-observability
Requested jobs: diagnostics-e2e
Summary: 0 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25241997799
Branch: fix/gateway-recovery-probe-observability
Requested jobs: all (no filter)
Summary: 20 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@ericksoa ericksoa merged commit 4828ef8 into main May 2, 2026
40 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. v0.0.33 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants