fix(onboard): roll back orphaned sandbox when dashboard alloc fails (#2174 followup)#2627
Conversation
…2174) #2411 (merged) addressed the crash half of #2174 — second onboard now auto-allocates a free dashboard port instead of throwing — but the "leaks ghost sandbox" half of the issue title is still live. When findAvailableDashboardPort exhausts the port range, it throws. ensureDashboardForward is called from createSandbox AFTER the openshell sandbox has already been streamed-into-existence (line 3830, after streamSandboxCreate at 3734 and the ready-wait at 3789-3803). The throw escapes upward unhandled, leaving the openshell sandbox in 'openshell sandbox list' with no NemoClaw registry entry — exactly the drift the original bug reported. Add an opt-in 'rollbackSandboxOnFailure' option to ensureDashboardForward that catches the throw, runs 'openshell sandbox delete <name>', emits an actionable error, and process.exit(1) — matching the existing not-ready rollback pattern at lines 3789-3803. Only the post-create call site at line 3830 opts in. The four reuse-path call sites operate on already-existing sandboxes where deleting on alloc failure would destroy user state; they keep the default (false). Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
Changes
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Extract the rollback error-line construction into a pure helper buildOrphanedSandboxRollbackMessage so the rollback path is testable without spawning subprocesses or exiting the process. Add test/onboard-rollback.test.ts with 4 cases: - successful delete → "removed — you can safely retry" line - failed delete → manual-cleanup guidance with the exact openshell command for the orphaned sandbox name - non-Error throwables coerced via String() so the alloc message survives - arbitrary sandbox names round-trip into the cleanup command verbatim ensureDashboardForward still drives runOpenshell + console.error + process.exit (untestable from outside the module without mocking subprocess); the helper isolates the deterministic part where the behavior actually lives. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
6185-6202: Align the JSDoc blocks with the actual function targets.The rollback behavior docs for
ensureDashboardForwardare currently separated by another JSDoc block; attach each block directly to its function to avoid doc drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6185 - 6202, The JSDoc paragraphs are misaligned: the rollback behavior doc meant for ensureDashboardForward is separated by another comment block and thus not attached to its function; move the JSDoc that documents rollbackSandboxOnFailure/rollback behavior so it sits immediately above the ensureDashboardForward function declaration, and likewise place the doc for "Build the actionable error lines" directly above the pure function that builds the error lines (the function that takes sandboxName, alloc-error, delete-result), ensuring each JSDoc block directly precedes its corresponding function (ensureDashboardForward and the error-line builder) to prevent doc drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/onboard-rollback.test.ts`:
- Line 21: Replace the CommonJS require call that assigns loadedOnboardInternals
with a top-level ESM import; remove the require("../dist/lib/onboard") usage and
add an import (e.g., import * as loadedOnboardInternals from
"../dist/lib/onboard") at the top of test/onboard-rollback.test.ts so the test
file complies with the project ESM requirements and all usages of
loadedOnboardInternals continue to work.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 6185-6202: The JSDoc paragraphs are misaligned: the rollback
behavior doc meant for ensureDashboardForward is separated by another comment
block and thus not attached to its function; move the JSDoc that documents
rollbackSandboxOnFailure/rollback behavior so it sits immediately above the
ensureDashboardForward function declaration, and likewise place the doc for
"Build the actionable error lines" directly above the pure function that builds
the error lines (the function that takes sandboxName, alloc-error,
delete-result), ensuring each JSDoc block directly precedes its corresponding
function (ensureDashboardForward and the error-line builder) to prevent doc
drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2f910e7e-fc67-448d-96b6-2fa3c82dd0cf
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard-rollback.test.ts
| ); | ||
| } | ||
|
|
||
| const loadedOnboardInternals = require("../dist/lib/onboard"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify test files under test/ are ESM-only (no CommonJS require/module.exports).
rg -n --glob 'test/**/*.{js,ts}' '\brequire\s*\(|\bmodule\.exports\b|\bexports\.'Repository: NVIDIA/NemoClaw
Length of output: 33865
Replace CommonJS require with ESM import in this test.
Line 21 uses require(), which breaks the test-folder ESM requirement per coding guidelines.
Proposed fix
import { describe, it, expect } from "vitest";
+import * as loadedOnboardInternals from "../dist/lib/onboard";
@@
-const loadedOnboardInternals = require("../dist/lib/onboard");
const onboardInternals =
typeof loadedOnboardInternals === "object" && loadedOnboardInternals !== null
? loadedOnboardInternals
: null;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/onboard-rollback.test.ts` at line 21, Replace the CommonJS require call
that assigns loadedOnboardInternals with a top-level ESM import; remove the
require("../dist/lib/onboard") usage and add an import (e.g., import * as
loadedOnboardInternals from "../dist/lib/onboard") at the top of
test/onboard-rollback.test.ts so the test file complies with the project ESM
requirements and all usages of loadedOnboardInternals continue to work.
There was a problem hiding this comment.
Pushing back on this. Tried the conversion in commit ef9b0d5 then reverted.
Adding export to buildOrphanedSandboxRollbackMessage in src/lib/onboard.ts (so the named ESM import would type-check) reshapes dist/lib/onboard.d.ts from unknown-shaped to a typed module. That breaks two other tests:
test/ssh-known-hosts.test.ts:27— TS2345test/wsl2-probe-timeout.test.ts:26,29— TS2559 + TS2345
Both rely on the d.ts staying opaque so their runtime guards (isOnboardKnownHostsInternals, isOnboardValidationCandidate) type-narrow against OnboardX | typeof import("../dist/lib/onboard") | null. Once import("../dist/lib/onboard") has known properties, TS rejects "no properties in common" with the OnboardXCandidate shape and the call sites stop compiling.
Keeping require() here matches the convention already used for this same file by:
test/credential-rotation.test.ts:43test/gateway-start-wait.test.ts:15test/gemini-probe-auth.test.ts:14test/ssh-known-hosts.test.tstest/wsl2-probe-timeout.test.ts
The mixed pattern in test/ reflects whether the imported module uses per-function export (pure-function modules like policies, runner, resolve-openshell) or bottom-of-file module.exports = {...} with runtime-guarded internals (onboard). The CLAUDE.md ESM rule applies cleanly to the former; the latter would need a coordinated sweep of onboard.ts + the four neighbouring tests, which is out of scope for #2174's leak fix.
Documented this rationale inline in test/onboard-rollback.test.ts so the next reader doesn't re-litigate.
The other CR finding (JSDoc alignment) is fixed in the same commit.
There was a problem hiding this comment.
@cjagwani, that's a thorough and convincing explanation — withdrawing this comment. The cascading TS2345/TS2559 failures in ssh-known-hosts and wsl2-probe-timeout make clear that keeping dist/lib/onboard.d.ts opaque is load-bearing for the existing narrowing guards, and the coordinated sweep to fix all four sibling tests is correctly scoped out of this PR. Appreciate you documenting it inline too.
✏️ Learnings added
Learnt from: cjagwani
Repo: NVIDIA/NemoClaw PR: 2627
File: test/onboard-rollback.test.ts:27-27
Timestamp: 2026-04-28T16:54:08.508Z
Learning: In the NVIDIA/NemoClaw test suite, `require()` (CommonJS) is intentionally used—rather than an ESM `import * as`—when importing from `../dist/lib/onboard` in test files (e.g., `test/onboard-rollback.test.ts`, `test/credential-rotation.test.ts`, `test/gateway-start-wait.test.ts`, `test/gemini-probe-auth.test.ts`, `test/ssh-known-hosts.test.ts`, `test/wsl2-probe-timeout.test.ts`). The reason is that `src/lib/onboard.ts` uses a bottom-of-file `module.exports = { ... }` pattern, which keeps `dist/lib/onboard.d.ts` opaque (unknown-shaped). If a named `export` is added to `onboard.ts` (as required for a typed ESM import), TS gives the module known properties and breaks the "no properties in common" narrowing guards (`isOnboardKnownHostsInternals`, `isOnboardValidationCandidate`, etc.) in sibling tests with TS2345/TS2559 errors. The `require()` convention for this module is deliberate and should not be replaced with ESM imports without a coordinated sweep of `onboard.ts` and all four neighbouring test files.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: laitingsheng
Repo: NVIDIA/NemoClaw PR: 2511
File: test/onboard-brave-validation.test.ts:45-48
Timestamp: 2026-04-27T09:53:20.358Z
Learning: In this repo’s test suite, prefer the established POSIX PATH separator `:` when constructing `process.env.PATH` in tests (e.g., `PATH: `${fakeBin}:${process.env.PATH || ""}``). Do not replace it with `path.delimiter` in these unit/integration tests, because they only run on Linux runners in CI here; `windows-latest` is limited to WSL e2e tests in `.github/workflows/wsl-e2e.yaml` and is already POSIX-compliant.
Address CodeRabbit feedback on #2627: - src/lib/onboard.ts: move the ensureDashboardForward JSDoc (with the rollbackSandboxOnFailure docs) so it sits directly above the function. Inserting buildOrphanedSandboxRollbackMessage between the JSDoc and its function had stranded the doc above the wrong function. - test/onboard-rollback.test.ts: add a comment explaining why the test uses require("../dist/lib/onboard") instead of named ESM import. CR suggested converting to ESM; tried it. Adding `export` to the helper in source breaks the d.ts shape that ssh-known-hosts.test.ts and wsl2-probe-timeout.test.ts depend on for runtime-guard type narrowing (TS2345/TS2559). The require() pattern is already the convention for this file (matches credential-rotation, gemini-probe-auth, ssh-known-hosts, wsl2-probe-timeout). Adopting CR's suggestion would require a sweep across those neighbours, which is out of scope here. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
) ## Summary Adds two regression assertions to Phase 4 of `test-double-onboard.sh` that verify the #2174 auto-allocation fix works end-to-end: 1. **Log check** — Second-sandbox onboard output contains `"is taken. Using port"` (confirms `ensureDashboardForward` auto-allocated) 2. **List check** — `nemoclaw list` shows two distinct `dashboard: http://127.0.0.1:<port>` entries for both sandboxes ## Related - Guards against regression of #2174 (fix landed via #2411 + #2627) - Supersedes #2455 (which had a stale assertion string `"allocated port"` that doesn't match any runtime output, and conflicting workflow changes already on main) ## Changes - `test/e2e/test-double-onboard.sh` — 21 lines added between the existing #849 regression check and Phase 5 ## Why this supersedes #2455 | Issue | #2455 | This PR | |-------|-------|---------| | Assertion string | `"allocated port"` (doesn't exist in codebase) | `"is taken. Using port"` (matches `ensureDashboardForward` line ~6600) | | Workflow changes | Adds a `double-onboard-e2e` job that already exists on main → merge conflict | None needed — job already runs this script | | Branch freshness | 95 commits behind main | Based on current main | ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `bash -n test/e2e/test-double-onboard.sh` — syntax check passes - [x] shellcheck clean - [x] No workflow changes needed — `double-onboard-e2e` nightly job already runs this script - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added end-to-end test assertions for multi-sandbox onboarding scenarios * Validates dashboard port auto-allocation prevents conflicts and maintains isolation across concurrent sandbox environments <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…VIDIA#2174 followup) (NVIDIA#2627) ## Summary Followup to NVIDIA#2411 (merged). Closes the second half of NVIDIA#2174's title — "leaks ghost sandbox" — that the merged fix didn't address. When `findAvailableDashboardPort` exhausts the port range it `throw`s. By the time `ensureDashboardForward` runs (`src/lib/onboard.ts:3830`), the openshell sandbox has already been streamed into existence (`streamSandboxCreate` at line 3734, ready-wait at lines 3789–3803). The throw escapes unhandled, leaving the sandbox in `openshell sandbox list` with no NemoClaw registry entry — exactly the drift reported in NVIDIA#2174. Issue comment with the analysis: NVIDIA#2174 (comment) ## Related Issue Followup to NVIDIA#2174 (closed by NVIDIA#2411). NVIDIA#2411 fixed the crash; this PR fixes the leak. ## Changes - `src/lib/onboard.ts` — `ensureDashboardForward` takes a new `{ rollbackSandboxOnFailure?: boolean }` option. On unrecoverable port-allocation failure (range exhaustion), the just-created openshell sandbox is deleted, an actionable error is printed, and the process exits with code 1 — mirroring the not-ready rollback pattern already in place at lines 3789–3803. - The post-create call site (line 3830) opts in (`rollbackSandboxOnFailure: true`). - The four reuse-path call sites (lines 3295, 3326, 3372, 3390) keep the default (`false`) — they operate on already-existing sandboxes where deleting on alloc failure would destroy user state. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npm run build:cli` + `npm run typecheck:cli` green - [x] `npx vitest run --project cli` — 2250/2250 passed - [x] `npx vitest run --project plugin` — 410/410 passed - [x] `npx prek run --files src/lib/onboard.ts` — green - [x] No new tests added: the rollback path mirrors the existing not-ready rollback (3789–3803), which also has no dedicated unit test. Adding test infrastructure here would exceed the minimum scope of this followup. - [x] No secrets, API keys, or credentials committed ## Notes for reviewer - Diff is ~35 lines on a single file. Surgical fix matching an existing pattern in the same function. - Doc comment on `ensureDashboardForward` updated to describe the new option and reference the pattern it mirrors. ## AI Disclosure - [x] AI-assisted — tool: Claude Code Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Sandbox initialization now attempts automatic rollback if dashboard port allocation fails, and exits safely when rollback succeeds. * When automatic cleanup fails, users are shown clear manual-cleanup instructions and the exact command to remove the orphaned sandbox. * **Tests** * Added runtime tests verifying rollback messaging covers both successful and failed cleanup scenarios and accurate sandbox naming in instructions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…rd (NVIDIA#2709) ## Summary Adds two regression assertions to Phase 4 of `test-double-onboard.sh` that verify the NVIDIA#2174 auto-allocation fix works end-to-end: 1. **Log check** — Second-sandbox onboard output contains `"is taken. Using port"` (confirms `ensureDashboardForward` auto-allocated) 2. **List check** — `nemoclaw list` shows two distinct `dashboard: http://127.0.0.1:<port>` entries for both sandboxes ## Related - Guards against regression of NVIDIA#2174 (fix landed via NVIDIA#2411 + NVIDIA#2627) - Supersedes NVIDIA#2455 (which had a stale assertion string `"allocated port"` that doesn't match any runtime output, and conflicting workflow changes already on main) ## Changes - `test/e2e/test-double-onboard.sh` — 21 lines added between the existing NVIDIA#849 regression check and Phase 5 ## Why this supersedes NVIDIA#2455 | Issue | NVIDIA#2455 | This PR | |-------|-------|---------| | Assertion string | `"allocated port"` (doesn't exist in codebase) | `"is taken. Using port"` (matches `ensureDashboardForward` line ~6600) | | Workflow changes | Adds a `double-onboard-e2e` job that already exists on main → merge conflict | None needed — job already runs this script | | Branch freshness | 95 commits behind main | Based on current main | ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `bash -n test/e2e/test-double-onboard.sh` — syntax check passes - [x] shellcheck clean - [x] No workflow changes needed — `double-onboard-e2e` nightly job already runs this script - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added end-to-end test assertions for multi-sandbox onboarding scenarios * Validates dashboard port auto-allocation prevents conflicts and maintains isolation across concurrent sandbox environments <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Summary
Followup to #2411 (merged). Closes the second half of #2174's title — "leaks ghost sandbox" — that the merged fix didn't address.
When
findAvailableDashboardPortexhausts the port range itthrows. By the timeensureDashboardForwardruns (src/lib/onboard.ts:3830), the openshell sandbox has already been streamed into existence (streamSandboxCreateat line 3734, ready-wait at lines 3789–3803). The throw escapes unhandled, leaving the sandbox inopenshell sandbox listwith no NemoClaw registry entry — exactly the drift reported in #2174.Issue comment with the analysis: #2174 (comment)
Related Issue
Followup to #2174 (closed by #2411). #2411 fixed the crash; this PR fixes the leak.
Changes
src/lib/onboard.ts—ensureDashboardForwardtakes a new{ rollbackSandboxOnFailure?: boolean }option. On unrecoverable port-allocation failure (range exhaustion), the just-created openshell sandbox is deleted, an actionable error is printed, and the process exits with code 1 — mirroring the not-ready rollback pattern already in place at lines 3789–3803.rollbackSandboxOnFailure: true).false) — they operate on already-existing sandboxes where deleting on alloc failure would destroy user state.Type of Change
Verification
npm run build:cli+npm run typecheck:cligreennpx vitest run --project cli— 2250/2250 passednpx vitest run --project plugin— 410/410 passednpx prek run --files src/lib/onboard.ts— greenNotes for reviewer
ensureDashboardForwardupdated to describe the new option and reference the pattern it mirrors.AI Disclosure
Signed-off-by: Charan Jagwani cjagwani@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests