fix(onboard): verify Docker-driver gateway is really serving before reporting healthy (#3111)#3378
Conversation
Auto-generated by pr-e2e-loop Phase 4.5. See PR #3362 for the coverage guard (failing E2E that is the acceptance gate for this fix).
…eporting healthy Fixes #3111. ## Problem startDockerDriverGateway() spawned the openshell-gateway binary with `detached: true` + child.unref(), then polled using two checks that could both lie: 1. **isPidAlive(childPid)** uses `process.kill(pid, 0)` which returns true for zombies. Since the parent Node process never wait()s on the detached child, crashed children linger as zombies indefinitely and isPidAlive reports them as alive. 2. **isGatewayHealthy(status, gwInfo, activeGwInfo)** is a pure string match over openshell CLI output. isGatewayConnected() matches on "Server Status" — the table *header* that `openshell status` always prints, regardless of whether the gateway endpoint is reachable. On a crashed gateway, the header is still emitted and the body contains `Connection refused` — but isGatewayConnected returns true anyway because it only looks for the header string. Result: onboard logged `✓ Docker-driver gateway is healthy` immediately after spawning a crashed binary, then proceeded to the next onboard step which failed with `Connection refused (os error 111)`. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped openshell-gateway binary, but the class of bug is platform-independent: any reason the binary fails to start (missing shared lib, permissions, CDI spec error, port conflict) surfaces the same false-positive. ## Fix Three coordinated changes in `startDockerDriverGateway`: 1. **Track child-exit via the ChildProcess 'exit' event**, not just isPidAlive. The exit listener flips a `childExited` flag that the poll loop consults alongside isPidAlive. This catches zombies that isPidAlive misses. 2. **Add verifyDockerDriverGatewayListening(port, timeoutMs)** — a TCP connect probe to 127.0.0.1:${GATEWAY_PORT} using node:net with an AbortSignal-equivalent timeout. Resolves boolean, never throws. This is the DockerDriver path equivalent of verifyGatewayContainerRunning (added for #2020 on the K3s path). 3. **Gate the 'healthy' log on the TCP probe**: the poll loop now only logs "✓ Docker-driver gateway is healthy" after isGatewayHealthy AND a successful TCP connect. On probe failure, keep polling — the binary may be slow to bind its listener. The childExited check at the top of the loop terminates us if the process actually died. Also improves the final failure message to include how the gateway exited (signal vs. exit code) so users don't have to dig into the gateway log. ## isGatewayHealthy intentionally not modified The #2020 follow-up test in test/gateway-liveness-probe.test.ts pins isGatewayHealthy to pure-function status (no I/O). Fix at the caller, not the shared helper — same pattern as #2020 used for the K3s path. ## Tests - New unit test: test/gateway-tcp-probe.test.ts (9 tests) - verifyDockerDriverGatewayListening resolves true for listening ports - resolves false for closed ports (Connection refused) - resolves false on timeout (non-routable host) - enforces minimum 50ms timeout - never throws - source-shape guards: child-exit tracking, childExited check in poll loop, TCP probe before "healthy" log, exit details in failure msg - E2E acceptance gate: gateway-health-honest-e2e (from PR #3362) must flip from red on main to green on this branch. ## Refactoring alignment - TODO(#2562) added for the timeout abstraction - TODO(#3213) added for structured advisory output - Coordinates with open PRs #3306 (gateway-bootstrap refactor) and #3312 (isGatewayHttpReady for K3s path) — same pattern, different surface. See ACCEPTANCE.md. Related: #3111 Coverage guard: #3362
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results — ✅ All requested jobs passedRun: 25706219740
|
…x md lint Address PR review findings: 1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc verifyDockerDriverGatewayListening (TCP probe) with a call to the shared isGatewayHttpReady helper — HTTP is strictly stronger than a raw TCP connect and is already the de-facto standard used by every other gateway-reuse decision site in onboard.ts after #3312. Docker-driver and K3s paths now converge on the same probe. 2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a helper that no longer exists). The shared helper's behavior is already covered by test/gateway-http-reuse-wait.test.ts (21 tests). Replace with test/gateway-health-honest-integration.test.ts (6 source-shape tests) guarding the #3111 integration pattern. 3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) — the 'checks' job flagged these; now passes locally. 4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is 'reuse existing helper' rather than 'add new TCP probe helper', and the refactoring alignment table records #3312 as 'adopted' rather than 'to be coordinated'. Behavior of the fix is unchanged: - child.once('exit') listener tracks zombies (detached children that process.kill(pid, 0) would falsely report as alive) - poll loop guard is childExited || !isPidAlive(childPid) - healthy log gated on isGatewayHttpReady() after isGatewayHealthy() - user-visible exit code/signal surfaced on failure Related: #3111 Coverage guard: #3362 — gateway-health-honest-e2e
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRecords gateway child exit state and requires a successful TCP connect probe before marking the Docker-driver gateway healthy; updates startup loop gating and failure diagnostics; adds a TCP readiness helper with unit tests and a source-shape integration test that asserts the new wiring and guards. ChangesGateway Health Check Reliability
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Selective E2E Results — ✅ All requested jobs passedRun: 25742970106
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
5420-5498:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winReplace blocking
sleep()with an async delay to allow the event loop to process the child exit event.The
child.once("exit", ...)handler only fires when the event loop processes the SIGCHLD signal. In the polling loop, when the gateway crashes beforeisGatewayHealthy()reaches the awaited HTTP probe, bothsleep()calls (lines 5452 and 5479) block the entire event loop. This preventschildExitedfrom being set by the handler, keeping it false for the entire poll window. The loop then relies onisPidAlive()alone, which treats zombies as alive, defeating the early-exit logic and preventing the new exit-code and exit-signal diagnostics from populating.Replace each
sleep(pollInterval)with an awaitable async delay to unblock the event loop between polling iterations:Suggested fix
const pollCount = envInt("NEMOCLAW_HEALTH_POLL_COUNT", 30); const pollInterval = envInt("NEMOCLAW_HEALTH_POLL_INTERVAL", 2); + const waitForNextPoll = (seconds: number) => + new Promise<void>((resolve) => setTimeout(resolve, seconds * 1000)); for (let i = 0; i < pollCount; i += 1) { // `#3111`: combine the exit-event check with isPidAlive. The exit event // catches zombies that isPidAlive misses; isPidAlive catches children // we somehow lost track of (defense in depth). if (childExited || !isPidAlive(childPid)) { break; } if (!registerDockerDriverGatewayEndpoint()) { - if (i < pollCount - 1) sleep(pollInterval); + if (i < pollCount - 1) await waitForNextPoll(pollInterval); continue; } const status = runCaptureOpenshell(["status"], { ignoreError: true }); const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true, }); const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true }); if (isGatewayHealthy(status, namedInfo, currentInfo)) { if (await isGatewayHttpReady()) { console.log(" ✓ Docker-driver gateway is healthy"); return; } } - if (i < pollCount - 1) sleep(pollInterval); + if (i < pollCount - 1) await waitForNextPoll(pollInterval); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 5420 - 5498, The polling loop uses a blocking sleep(pollInterval) which prevents the child.once("exit", ...) handler from running; replace both blocking calls to sleep(pollInterval) inside the for loop with an awaitable non-blocking delay (e.g. an async delay that wraps setTimeout) and await it (await delay(pollInterval * <appropriate unit>)) so the event loop can process SIGCHLD and set childExited/childExitCode; ensure the enclosing function is async (or already is) so you can await the delay, and update the two instances where sleep(pollInterval) is called in the loop (the branch after registerDockerDriverGatewayEndpoint() and the end-of-loop backoff) to use await delay(...) instead.
🧹 Nitpick comments (2)
test/gateway-health-honest-integration.test.ts (2)
91-93: ⚡ Quick winRemove the arbitrary 200-char window in the shared-import assertion.
{0,200}can fail on harmless formatting/reordering. Matching the destructuring+require shape directly is more stable and still enforces the same contract.Proposed regex stabilization
- expect(content).toMatch( - /isGatewayHttpReady,[\s\S]{0,200}?require\(\s*["']\.\/onboard\/gateway-http-readiness["']/, - ); + expect(content).toMatch( + /\{[\s\S]*\bisGatewayHttpReady\b[\s\S]*\}\s*=\s*require\(\s*["']\.\/onboard\/gateway-http-readiness["']\s*\)/, + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/gateway-health-honest-integration.test.ts` around lines 91 - 93, The test's regex uses an arbitrary {0,200} window which can break on harmless formatting changes; update the assertion in the test that inspects content (the expect(content).toMatch call) to directly match the destructuring and require shape for isGatewayHttpReady and ./onboard/gateway-http-readiness without the {0,200} range — i.e., remove the {0,200} and use a stable pattern that allows optional whitespace/newlines between "isGatewayHttpReady," and require("./onboard/gateway-http-readiness") so the assertion targets the exact destructuring+require pairing.
58-62: ⚡ Quick winTighten the healthy-log gating assertion to reduce false positives.
The current check only verifies that some
await isGatewayHttpReady(appears before the log string, not that they are coupled in the same decision path.Proposed assertion hardening
- const healthyIdx = fnBody.indexOf("Docker-driver gateway is healthy"); - expect(healthyIdx).toBeGreaterThan(0); - const before = fnBody.slice(0, healthyIdx); - expect(before).toMatch(/await\s+isGatewayHttpReady\(/); + expect(fnBody).toMatch( + /if\s*\([\s\S]*await\s+isGatewayHttpReady\([^)]*\)[\s\S]*\)\s*\{[\s\S]{0,300}Docker-driver gateway is healthy/, + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/gateway-health-honest-integration.test.ts` around lines 58 - 62, The current test only verifies that some "await isGatewayHttpReady(" appears before the "Docker-driver gateway is healthy" log, which can yield false positives; change the assertion to find the last occurrence of "await isGatewayHttpReady(" before healthyIdx (use fnBody.lastIndexOf("await isGatewayHttpReady(", healthyIdx)) and assert that the substring from that index to healthyIdx includes a decision construct tying them together (e.g., matches /if\s*\(.*await\s+isGatewayHttpReady\(|if\s*\(\s*await\s+isGatewayHttpReady/ or that the call is immediately used in an if/ternary), ensuring the awaited call and the "Docker-driver gateway is healthy" log are in the same control-flow branch rather than merely earlier in the file; update references to healthyIdx, fnBody, and isGatewayHttpReady accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ACCEPTANCE.md`:
- Line 12: The ACCEPTANCE.md contains emoji status markers (e.g., 🔴, 🟢, 🟡, ✨)
in technical prose such as the nightly job name gateway-health-honest-e2e and
several other sections; remove all emoji and replace them with plain text status
labels (e.g., "red"/"green"/"yellow"/"beta" or more descriptive words like
"failing"/"passing") consistently across the referenced areas (the job line and
the other occurrences called out in the comment), updating any inline
descriptions so they read as standard text without emojis and preserving the
original meaning and formatting.
- Line 1: Add the required SPDX license header as the very first line of
ACCEPTANCE.md (before the existing "# Issue `#3111` — Development Plan +
Acceptance Criterion" title); insert a single-line SPDX-License-Identifier with
the project's chosen identifier (for example MIT or Apache-2.0) so the file
complies with the `**/*.{js,ts,tsx,sh,md}` header rule and passes linting.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5420-5498: The polling loop uses a blocking sleep(pollInterval)
which prevents the child.once("exit", ...) handler from running; replace both
blocking calls to sleep(pollInterval) inside the for loop with an awaitable
non-blocking delay (e.g. an async delay that wraps setTimeout) and await it
(await delay(pollInterval * <appropriate unit>)) so the event loop can process
SIGCHLD and set childExited/childExitCode; ensure the enclosing function is
async (or already is) so you can await the delay, and update the two instances
where sleep(pollInterval) is called in the loop (the branch after
registerDockerDriverGatewayEndpoint() and the end-of-loop backoff) to use await
delay(...) instead.
---
Nitpick comments:
In `@test/gateway-health-honest-integration.test.ts`:
- Around line 91-93: The test's regex uses an arbitrary {0,200} window which can
break on harmless formatting changes; update the assertion in the test that
inspects content (the expect(content).toMatch call) to directly match the
destructuring and require shape for isGatewayHttpReady and
./onboard/gateway-http-readiness without the {0,200} range — i.e., remove the
{0,200} and use a stable pattern that allows optional whitespace/newlines
between "isGatewayHttpReady," and require("./onboard/gateway-http-readiness") so
the assertion targets the exact destructuring+require pairing.
- Around line 58-62: The current test only verifies that some "await
isGatewayHttpReady(" appears before the "Docker-driver gateway is healthy" log,
which can yield false positives; change the assertion to find the last
occurrence of "await isGatewayHttpReady(" before healthyIdx (use
fnBody.lastIndexOf("await isGatewayHttpReady(", healthyIdx)) and assert that the
substring from that index to healthyIdx includes a decision construct tying them
together (e.g., matches
/if\s*\(.*await\s+isGatewayHttpReady\(|if\s*\(\s*await\s+isGatewayHttpReady/ or
that the call is immediately used in an if/ternary), ensuring the awaited call
and the "Docker-driver gateway is healthy" log are in the same control-flow
branch rather than merely earlier in the file; update references to healthyIdx,
fnBody, and isGatewayHttpReady accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 70d3075d-2157-4107-96aa-f203a903950a
📒 Files selected for processing (3)
ACCEPTANCE.mdsrc/lib/onboard.tstest/gateway-health-honest-integration.test.ts
- Add SPDX license header (HTML-comment style, matching tools/e2e-advisor/README.md) - Replace emoji status markers (🔴/🟢/🟡/✨) with plain-text labels ([E2E], [OK], [Unit]) per the coding guideline 'No emoji in technical prose'. Literal '✓' characters in log-output examples are retained because they are not decorative — they are the actual bytes the onboard code prints.
There was a problem hiding this comment.
Good call — removed in e6c08a7. ACCEPTANCE.md was a pr-e2e-loop workflow artifact (the skill writes it as a handoff from the coverage-guard PR to the fix PR). The relevant acceptance content — "gateway-health-honest-e2e must flip from red on main to green on this branch" — is already captured in the PR description and the fix commit message, so the file wasn't adding anything durable. I'll update the skill to either not emit this file into the repo or write it under a gitignored path.
Selective E2E Results — ❌ Some jobs failedRun: 25747662365
|
|
Run: https://github.com/NVIDIA/NemoClaw/actions/runs/25747662365 Results:
Please investigate/fix the failing E2E path and rerun the targeted validation. This PR should not proceed until the advisor-required E2E set is passing. To have an agent rerun this after fixes, use a prompt like:
Note: I attempted to submit this as a formal “request changes” review, but GitHub does not allow requesting changes on your own PR from this account, so posting as a blocking comment instead. |
|
✨ Related open issues: |
…iness Addresses the failing openshell-gateway-upgrade-e2e surfaced by the e2e-advisor on PR #3378. ## Problem The previous commit converged the Docker-driver gateway's readiness probe onto isGatewayHttpReady from ./onboard/gateway-http-readiness (introduced by #3312). That broke openshell-gateway-upgrade-e2e: Starting OpenShell gateway... [PASS] Stale gateway is healthy with supervisor:0.0.36 [INFO] Invoking NemoClaw gateway start path; it must restart the stale process Docker-driver gateway failed to start. <-- 60s timeout The gateway log showed the process was actually serving correctly: POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0 POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0 GET / response status=404 latency_ms=0 ## Root cause isGatewayHttpReady does GET /, and only accepts {200, 401}. The K3s-path gateway answers GET / via a dispatcher catch-all. The Docker-driver gateway, by contrast, only serves routes under /openshell.v1.OpenShell/* and returns 404 for GET /. So the probe always fails on the Docker- driver path regardless of whether the gateway is up. ## Fix Revert to a plain TCP probe on the gateway port for the Docker-driver path. Add a small isDockerDriverGatewayTcpReady(port, timeoutMs, host) helper next to isPidAlive. This is semantic-free — it just asks "is anyone listening" — which is exactly what's needed to detect the #3111 failure mode (crashed binary → nothing listening) without making assumptions about HTTP route shape. isGatewayHttpReady is left in place for all K3s-path call sites; the two paths use different probes because the two gateway types expose different HTTP surfaces. TODO(#3213) filed for future convergence. ## Tests test/gateway-health-honest-integration.test.ts now has 10 tests: 5 behavioural: listening / closed / timeout / min-timeout / no-throw 5 source-shape: child-exit tracking, poll-loop guard, TCP probe call site, regression guard against reintroducing the HTTP probe in the Docker-driver loop, exit-details surfaced ## Expected CI result - gateway-health-honest-e2e: GREEN (already was — this was never affected) - openshell-gateway-upgrade-e2e: FLIPS BACK to GREEN (the breakage this commit addresses) Related: #3111 Broke in: 29bba3e (refactor(onboard): converge on shared isGatewayHttpReady)
Selective E2E Results — ✅ All requested jobs passedRun: 25752904986
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
5466-5492:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle
spawn()failures before the PID check to preserve exec error details.If
openshell-gatewaycannot exec, Node reports this through the child"error"event, but the current code only listens for"exit". When exec fails,pidremains unset and we throw a generic "did not return a pid" error, losing the real failure reason. ThestartModelRouter()function in this file already handles both"error"and"exit"events and includes the error detail in the PID-check message. Apply the same pattern here.Suggested fix
const child = spawn(gatewayBin, [], { detached: true, stdio: ["ignore", outFd, errFd], env: { ...process.env, ...gatewayEnv, }, }); // Track early-exit via the ChildProcess event rather than process.kill(pid, 0). // isPidAlive() returns true for zombies (detached children that exited but // were not reaped by a wait() call), which masks the failure mode reported // in `#3111`: the binary crashes on startup but the pid remains in the // process table as a zombie. The 'exit' event fires reliably for detached // children once libuv observes SIGCHLD. We still call unref() below so the // child doesn't hold the Node event loop open after we return. let childExited = false; let childExitCode: number | null = null; let childExitSignal: NodeJS.Signals | null = null; + let childStartError: string | null = null; + child.once("error", (error: NodeJS.ErrnoException) => { + childExited = true; + childStartError = error.message; + }); child.once("exit", (code: number | null, signal: NodeJS.Signals | null) => { childExited = true; childExitCode = code; childExitSignal = signal; }); child.unref(); const childPid = child.pid ?? 0; if (childPid <= 0) { + await new Promise((resolve) => setImmediate(resolve)); - throw new Error("OpenShell gateway process did not return a pid"); + throw new Error( + childStartError + ? `OpenShell gateway process failed to start: ${childStartError}` + : "OpenShell gateway process did not return a pid", + ); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 5466 - 5492, The code currently only listens for the child's "exit" event and then throws a generic error when child.pid is missing; register an "error" handler on the spawned process (the same way startModelRouter() does) to capture exec errors from spawn(gatewayBin, ...) into a variable (e.g., childExecError) and include that error's message/details when throwing the "OpenShell gateway process did not return a pid" error; update the PID-check logic around child.pid to reference the captured error and include it in the thrown Error so exec failures are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/gateway-health-honest-integration.test.ts`:
- Around line 92-97: The test currently only asserts an upper bound on elapsed
time and can miss regressions where timeout is clamped to zero; update the test
around isDockerDriverGatewayTcpReady(port, 0) to assert that the elapsed time is
at least ~50ms (verifying the minimum timeout floor) and still less than the
existing 500ms upper bound by measuring Date.now() - started, i.e. ensure the
assertion checks both a lower bound (>=50ms) and the existing upper bound to
confirm the 50ms clamp is enforced.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5466-5492: The code currently only listens for the child's "exit"
event and then throws a generic error when child.pid is missing; register an
"error" handler on the spawned process (the same way startModelRouter() does) to
capture exec errors from spawn(gatewayBin, ...) into a variable (e.g.,
childExecError) and include that error's message/details when throwing the
"OpenShell gateway process did not return a pid" error; update the PID-check
logic around child.pid to reference the captured error and include it in the
thrown Error so exec failures are preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b19416e4-11dc-4e7d-995e-0a877c730853
📒 Files selected for processing (3)
ACCEPTANCE.mdsrc/lib/onboard.tstest/gateway-health-honest-integration.test.ts
✅ Files skipped from review due to trivial changes (1)
- ACCEPTANCE.md
Addresses two review findings on PR #3378: 1. cv (human): 'This shouldn't be in the PR' on ACCEPTANCE.md. Fair call — ACCEPTANCE.md is a pr-e2e-loop workflow artifact, not a code deliverable the repo should carry forward. The acceptance criterion and green-flip contract already live in the PR body and the commit messages on this branch. 2. CodeRabbit (Minor): the 'enforces a minimum timeout of 50ms' test only verified an upper bound, so a regression that removed the Math.max(50, timeoutMs) clamp and returned immediately would still pass. Rewrite the test to use a non-routable host (10.255.255.1) so an immediate ECONNREFUSED can't mask the regression, and assert elapsed >= 40 ms (10 ms slack under the 50 ms floor) AND < 2000 ms (generous ceiling for slow CI).
Addresses PR review feedback — onboard.ts is the God Object being decomposed, and growing its exports is an anti-pattern. Follow the pattern established by PR #3312 (which added src/lib/onboard/gateway-http-readiness.ts) and create a peer module for the Docker-driver path. New file: src/lib/onboard/gateway-tcp-readiness.ts Exports: isGatewayTcpReady(port, timeoutMs, host) Model: mirrors gateway-http-readiness.ts structure exactly — default port from GATEWAY_PORT, minimum-timeout clamp, pure async, never throws. New file: src/lib/onboard/gateway-tcp-readiness.test.ts Co-located behavioural tests per the established pattern (see temp-files.test.ts, legacy-command.test.ts, etc.) 6 tests: listening / closed / timeout / min-timeout / no-throw / default-port. Imports via ES module, not via dist/ indirection. onboard.ts net change: Before: +88 lines (+48 helper + integration) After: +22 lines (2 require lines + integration only) Related to: #3111
Selective E2E Results — ✅ All requested jobs passedRun: 25759369254
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/lib/onboard/gateway-tcp-readiness.ts (1)
96-100: ⚡ Quick winDrop unnecessary defensive
try/catcharoundsocket.destroy().This catch is likely redundant for this internal path and can mask unexpected behavior.
Based on learnings: "avoid adding defensive error handling ... around internal helper logic when there is no realistic throwing path ... remove unnecessary try/catch."Proposed simplification
- try { - socket.destroy(); - } catch { - /* best effort */ - } + socket.destroy();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/gateway-tcp-readiness.ts` around lines 96 - 100, Remove the unnecessary defensive try/catch around socket.destroy(): locate the socket.destroy() call in gateway-tcp-readiness (the block showing try { socket.destroy(); } catch { /* best effort */ }) and replace it with a plain socket.destroy() invocation so any unexpected errors aren't silently swallowed; ensure no other logic depends on the catch block and keep the surrounding function/flow unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5290-5296: Trim the long inline rationale block around the
ChildProcess handling to a single short comment referencing `#3111` and the key
behavior (use 'exit' event instead of isPidAlive() to avoid zombies) and remove
the extended explanation about libuv, SIGCHLD, and unref(); keep the existing
behavior and any calls to unref() and the 'exit' listener intact (search for
isPidAlive(), ChildProcess, unref(), and the 'exit' event in this area) and move
the detailed rationale into the extracted helper/tests or documentation so the
source stays under the entrypoint budget without changing runtime logic.
- Around line 5331-5347: The reuse paths that simply log "✓ Reusing existing
Docker-driver gateway" (and the branch that skips Step 2 during resume/onboard)
must also verify actual TCP readiness via isGatewayTcpReady() before
returning/succeeding; update the code in the reuse/skip branches to call await
isGatewayTcpReady(), only log the success and return when it resolves true, and
if it returns false continue to run the existing polling/start logic (or fall
through to the fresh-spawn loop) and emit a warning or retry instead of a
false-positive success. Ensure you modify the branches that currently
short-circuit (the "Reusing existing Docker-driver gateway" return) and the Step
2 skip branch so both reference isGatewayTcpReady() and behave consistently with
the fresh-spawn path.
---
Nitpick comments:
In `@src/lib/onboard/gateway-tcp-readiness.ts`:
- Around line 96-100: Remove the unnecessary defensive try/catch around
socket.destroy(): locate the socket.destroy() call in gateway-tcp-readiness (the
block showing try { socket.destroy(); } catch { /* best effort */ }) and replace
it with a plain socket.destroy() invocation so any unexpected errors aren't
silently swallowed; ensure no other logic depends on the catch block and keep
the surrounding function/flow unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7a6b0bc3-1c0c-4d4b-8791-34632f2f6db6
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/gateway-tcp-readiness.test.tssrc/lib/onboard/gateway-tcp-readiness.tstest/gateway-health-honest-integration.test.ts
…odules
Addresses PR review: 'This PR is adding lines to onboard.ts — which is
now an antipattern' and the onboard-entrypoint-budget CI check.
src/lib/onboard.ts is the God Object being decomposed; growing it is
explicitly flagged as "away from target architecture" in the repo's
review guidelines, and the onboard-entrypoint-budget workflow fails the
PR if the entrypoint grows net-positive.
## Changes
Extract three focused modules under src/lib/onboard/ that collectively
replace the inline logic added in earlier commits on this branch:
- src/lib/onboard/gateway-tcp-readiness.ts
isGatewayTcpReady(port, timeoutMs, host): plain TCP liveness probe
for the Docker-driver gateway. Peer of gateway-http-readiness.ts
(introduced by #3312). Uses a minimum-timeout clamp so a regression
that drops Math.max(50, timeoutMs) is caught by tests.
- src/lib/onboard/child-exit-tracker.ts
trackChildExit(child): attaches a one-shot 'exit' listener to a
ChildProcess and returns a ChildExitState view. Zombie-safe: fires
reliably for detached children that isPidAlive(pid, 0) would
falsely report as alive (#3111). Also exposes describeExit() —
'exited with code 127' / 'killed by signal SIGKILL' — so callers
don't have to reformat the signal/code themselves.
- src/lib/onboard/docker-driver-gateway-failure.ts
reportDockerDriverGatewayStartFailure(logPath, childExit, opts):
the full diagnostic bundle emitted on startup failure — 'failed to
start' header, exit descriptor, redacted log tail, troubleshooting
footer. Honors exitOnFailure (production) vs. throw-only (recovery).
Each module has co-located behavioural tests (6 + 5 + 6 tests = 17),
using direct ES imports instead of the dist/ indirection that
test/onboard.test.ts uses. Source-shape guards in
test/gateway-health-honest-integration.test.ts enforce that the
three modules stay extracted (the entrypoint cannot reintroduce the
inline implementations).
## Result
src/lib/onboard.ts: 15 additions, 21 deletions — net -6 lines.
src/lib/onboard/**: 3 new modules, 17 new unit tests, full behavioural
+ signal-vs-code coverage on all three helpers.
Both E2E tests remain green on this branch:
- gateway-health-honest-e2e (#3111 coverage guard)
- openshell-gateway-upgrade-e2e (regression target the earlier
HTTP-probe attempt broke; TCP probe is the correct fit)
Related: #3111
Aligns with architecture goal tracked by onboard-entrypoint-budget CI.
TS7006 under strict mode — the implicit 'any' on errSpy.mock.calls.map's callback parameter. No behavior change.
Selective E2E Results — ✅ All requested jobs passedRun: 25760908563
|
Summary
nemoclaw onboardwas printing✓ Docker-driver gateway is healthyeven when the openshell-gateway binary crashed immediately on startup, then failing the next step withConnection refused. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped binary, but the underlying NemoClaw bug is platform-independent — any reason the binary fails to start (missing shared lib, CDI spec error, port conflict, permissions, corrupted download) surfaces the same false-positive.This PR fixes the false-positive at the caller site in
startDockerDriverGateway(), without modifying the sharedisGatewayHealthy()(which is pinned to pure-function status by the #2020 follow-up test).Fixes #3111.
Closes the gap covered by the failing E2E test added in #3362 —
gateway-health-honest-e2eflips from 🔴 red onmainto 🟢 green on this branch.Root cause
startDockerDriverGateway()insrc/lib/onboard.tsspawned the gateway binary withspawn(..., { detached: true })+child.unref(), then polled using two checks that could both lie:isPidAlive(childPid)usesprocess.kill(pid, 0)which returnstruefor zombies. Since the parent Node process neverwait()s on the detached child, crashed children linger as zombies andisPidAlivereports them as alive.isGatewayHealthy(status, gwInfo, activeGwInfo)is a pure string match over openshell CLI output.isGatewayConnectedinsrc/lib/state/gateway.tsmatches on"Server Status"— the table header thatopenshell statusalways prints. On a crashed gateway, the header is still emitted and the body contains× client error (Connect) tcp connect error Connection refused— butisGatewayConnectedreturns true anyway.Smoking gun from the red-on-main run 25698031380:
Changes
src/lib/onboard.ts— three coordinated changes instartDockerDriverGateway:Track child-exit via the ChildProcess
'exit'event, not justisPidAlive. Achild.once('exit', ...)listener flips achildExitedflag that the poll loop consults alongsideisPidAlive. This catches zombies thatisPidAlivemisses and also captures the exit code/signal for the failure message.Add
verifyDockerDriverGatewayListening(port, timeoutMs)— a TCP connect probe to127.0.0.1:${GATEWAY_PORT}usingnode:netwith a socket timeout. Resolves boolean, never throws. This is the Docker-driver path equivalent ofverifyGatewayContainerRunning(added for [All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy #2020 on the K3s path).Gate the "healthy" log on the TCP probe: the poll loop now only logs
✓ Docker-driver gateway is healthyafterisGatewayHealthyAND a successful TCP connect. On probe failure the loop keeps polling — the binary may still be binding its listener. ThechildExitedcheck at the top of the loop terminates us if the process actually died.Also improves the final failure message to include how the gateway exited (signal vs. exit code) so users don't have to
tailthe gateway log.What this PR does NOT change
isGatewayHealthyinsrc/lib/state/gateway.tsis left untouched. The #2020 follow-up test attest/gateway-liveness-probe.test.ts:74pins it to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at the caller, not the shared helper — same pattern as #2020.Tests
test/gateway-tcp-probe.test.ts(9 tests)verifyDockerDriverGatewayListeningresolves true for listening portssrc/lib/onboard.ts): child-exit tracking present,childExited || !isPidAlivecheck at poll-loop top, TCP probe called before the "healthy" log, exit details surfaced in the failure messagegateway-health-honest-e2e(from test(e2e): add gateway-health-honest coverage guard for #3111 #3362) — red on main, expected green on this branch. Will dispatch via nightly-e2e selective run once PR opens.vitestsuite passes (CLI);test/gateway-state.test.ts(47 tests) andtest/gateway-liveness-probe.test.ts(7 tests) still green — no behavior change in the shared helpers.Refactoring alignment
Noted in
ACCEPTANCE.md(also in this branch):TODO(#2562)on the TCP probe helper's timeout logic, for mechanical adoption later.TODO(#3213)on the failure-message format so it can migrate to structured advisories later.isGatewayHttpReadyfor K3s path) — same pattern, different surface. Easy to converge once both land; can extract a shared "gateway liveness probe" module if desired.gateway-bootstrap.tsextraction) — doesn't touchstartDockerDriverGateway; no conflict expected.Related
gateway-health-honest-e2e)Summary by CodeRabbit