Skip to content

fix(onboard): verify Docker-driver gateway is really serving before reporting healthy (#3111)#3378

Merged
cv merged 12 commits into
mainfrom
issue-3111-gateway-health-honesty
May 12, 2026
Merged

fix(onboard): verify Docker-driver gateway is really serving before reporting healthy (#3111)#3378
cv merged 12 commits into
mainfrom
issue-3111-gateway-health-honesty

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 12, 2026

Summary

nemoclaw onboard was printing ✓ Docker-driver gateway is healthy even when the openshell-gateway binary crashed immediately on startup, then failing the next step with Connection refused. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped binary, but the underlying NemoClaw bug is platform-independent — any reason the binary fails to start (missing shared lib, CDI spec error, port conflict, permissions, corrupted download) surfaces the same false-positive.

This PR fixes the false-positive at the caller site in startDockerDriverGateway(), without modifying the shared isGatewayHealthy() (which is pinned to pure-function status by the #2020 follow-up test).

Fixes #3111.
Closes the gap covered by the failing E2E test added in #3362gateway-health-honest-e2e flips from 🔴 red on main to 🟢 green on this branch.

Root cause

startDockerDriverGateway() in src/lib/onboard.ts spawned the gateway binary with spawn(..., { detached: true }) + child.unref(), then polled using two checks that could both lie:

  1. isPidAlive(childPid) uses process.kill(pid, 0) which returns true for zombies. Since the parent Node process never wait()s on the detached child, crashed children linger as zombies and isPidAlive reports them as alive.

  2. isGatewayHealthy(status, gwInfo, activeGwInfo) is a pure string match over openshell CLI output. isGatewayConnected in src/lib/state/gateway.ts matches on "Server Status" — the table header that openshell status always prints. On a crashed gateway, the header is still emitted and the body contains × client error (Connect) tcp connect error Connection refused — but isGatewayConnected returns true anyway.

Smoking gun from the red-on-main run 25698031380:

[DIAG] openshell status: Server Status
  Gateway: nemoclaw
  Server: http://127.0.0.1:8080
Error:   × client error (Connect)
  ├─▶ tcp connect error
  ╰─▶ Connection refused (os error 111)

Changes

src/lib/onboard.ts — three coordinated changes in startDockerDriverGateway:

  1. Track child-exit via the ChildProcess 'exit' event, not just isPidAlive. A child.once('exit', ...) listener flips a childExited flag that the poll loop consults alongside isPidAlive. This catches zombies that isPidAlive misses and also captures the exit code/signal for the failure message.

  2. Add verifyDockerDriverGatewayListening(port, timeoutMs) — a TCP connect probe to 127.0.0.1:${GATEWAY_PORT} using node:net with a socket timeout. Resolves boolean, never throws. This is the Docker-driver path equivalent of verifyGatewayContainerRunning (added for [All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy #2020 on the K3s path).

  3. Gate the "healthy" log on the TCP probe: the poll loop now only logs ✓ Docker-driver gateway is healthy after isGatewayHealthy AND a successful TCP connect. On probe failure the loop keeps polling — the binary may still be binding its listener. The childExited check at the top of the loop terminates us if the process actually died.

Also improves the final failure message to include how the gateway exited (signal vs. exit code) so users don't have to tail the gateway log.

What this PR does NOT change

isGatewayHealthy in src/lib/state/gateway.ts is left untouched. The #2020 follow-up test at test/gateway-liveness-probe.test.ts:74 pins it to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at the caller, not the shared helper — same pattern as #2020.

Tests

  • New unit test: test/gateway-tcp-probe.test.ts (9 tests)
    • verifyDockerDriverGatewayListening resolves true for listening ports
    • resolves false for closed ports (Connection refused)
    • resolves false on timeout (non-routable RFC 1918 host)
    • enforces minimum 50 ms timeout
    • never throws
    • source-shape guards (regexes over src/lib/onboard.ts): child-exit tracking present, childExited || !isPidAlive check at poll-loop top, TCP probe called before the "healthy" log, exit details surfaced in the failure message
  • E2E acceptance gate: gateway-health-honest-e2e (from test(e2e): add gateway-health-honest coverage guard for #3111 #3362) — red on main, expected green on this branch. Will dispatch via nightly-e2e selective run once PR opens.
  • Existing tests: full vitest suite passes (CLI); test/gateway-state.test.ts (47 tests) and test/gateway-liveness-probe.test.ts (7 tests) still green — no behavior change in the shared helpers.

Refactoring alignment

Noted in ACCEPTANCE.md (also in this branch):

Related

Summary by CodeRabbit

  • New Features
    • Added a TCP readiness check during gateway startup to ensure the service is actually listening before reporting healthy.
  • Bug Fixes
    • Reduced false “alive” reports for the gateway by detecting early child-process termination.
    • Improved startup failure messages to include how the gateway process terminated (signal vs exit code).
  • Tests
    • Added integration and unit tests to validate startup health and TCP probe behavior.

Review Change Stack

jyaunches added 2 commits May 11, 2026 18:43
Auto-generated by pr-e2e-loop Phase 4.5. See PR #3362 for the
coverage guard (failing E2E that is the acceptance gate for this fix).
…eporting healthy

Fixes #3111.

## Problem

startDockerDriverGateway() spawned the openshell-gateway binary with
`detached: true` + child.unref(), then polled using two checks that
could both lie:

1. **isPidAlive(childPid)** uses `process.kill(pid, 0)` which returns
   true for zombies. Since the parent Node process never wait()s on
   the detached child, crashed children linger as zombies indefinitely
   and isPidAlive reports them as alive.

2. **isGatewayHealthy(status, gwInfo, activeGwInfo)** is a pure string
   match over openshell CLI output. isGatewayConnected() matches on
   "Server Status" — the table *header* that `openshell status`
   always prints, regardless of whether the gateway endpoint is
   reachable. On a crashed gateway, the header is still emitted and
   the body contains `Connection refused` — but isGatewayConnected
   returns true anyway because it only looks for the header string.

Result: onboard logged `✓ Docker-driver gateway is healthy`
immediately after spawning a crashed binary, then proceeded to the
next onboard step which failed with `Connection refused (os error 111)`.

The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in
the shipped openshell-gateway binary, but the class of bug is
platform-independent: any reason the binary fails to start (missing
shared lib, permissions, CDI spec error, port conflict) surfaces the
same false-positive.

## Fix

Three coordinated changes in `startDockerDriverGateway`:

1. **Track child-exit via the ChildProcess 'exit' event**, not just
   isPidAlive. The exit listener flips a `childExited` flag that the
   poll loop consults alongside isPidAlive. This catches zombies that
   isPidAlive misses.

2. **Add verifyDockerDriverGatewayListening(port, timeoutMs)** — a
   TCP connect probe to 127.0.0.1:${GATEWAY_PORT} using node:net with
   an AbortSignal-equivalent timeout. Resolves boolean, never throws.
   This is the DockerDriver path equivalent of verifyGatewayContainerRunning
   (added for #2020 on the K3s path).

3. **Gate the 'healthy' log on the TCP probe**: the poll loop now only
   logs "✓ Docker-driver gateway is healthy" after isGatewayHealthy
   AND a successful TCP connect. On probe failure, keep polling — the
   binary may be slow to bind its listener. The childExited check at
   the top of the loop terminates us if the process actually died.

Also improves the final failure message to include how the gateway
exited (signal vs. exit code) so users don't have to dig into the
gateway log.

## isGatewayHealthy intentionally not modified

The #2020 follow-up test in test/gateway-liveness-probe.test.ts pins
isGatewayHealthy to pure-function status (no I/O). Fix at the caller,
not the shared helper — same pattern as #2020 used for the K3s path.

## Tests

- New unit test: test/gateway-tcp-probe.test.ts (9 tests)
  - verifyDockerDriverGatewayListening resolves true for listening ports
  - resolves false for closed ports (Connection refused)
  - resolves false on timeout (non-routable host)
  - enforces minimum 50ms timeout
  - never throws
  - source-shape guards: child-exit tracking, childExited check in poll
    loop, TCP probe before "healthy" log, exit details in failure msg
- E2E acceptance gate: gateway-health-honest-e2e (from PR #3362) must
  flip from red on main to green on this branch.

## Refactoring alignment

- TODO(#2562) added for the timeout abstraction
- TODO(#3213) added for structured advisory output
- Coordinates with open PRs #3306 (gateway-bootstrap refactor) and
  #3312 (isGatewayHttpReady for K3s path) — same pattern, different
  surface. See ACCEPTANCE.md.

Related: #3111
Coverage guard: #3362
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

E2E Advisor Recommendation

Required E2E: gateway-health-honest-e2e, openshell-gateway-upgrade-e2e
Optional E2E: overlayfs-autofix-e2e, issue-2478-crash-loop-recovery-e2e, onboard-repair-e2e, onboard-resume-e2e, sandbox-survival-e2e, cloud-onboard-e2e

Dispatch hint: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e,overlayfs-autofix-e2e,issue-2478-crash-loop-recovery-e2e,onboard-repair-e2e,onboard-resume-e2e,sandbox-survival-e2e,cloud-onboard-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gateway-health-honest-e2e: This is the exact coverage guard for [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111 — the issue this PR fixes. The test installs a sabotage gateway binary that crashes at startup and asserts onboard must NOT log 'Docker-driver gateway is healthy' and must exit non-zero. Validates both halves of the fix (TCP probe + child-exit tracker) end-to-end under a real install.
  • openshell-gateway-upgrade-e2e: Explicit regression target called out in test/gateway-health-honest-integration.test.ts: a prior attempt at this fix used isGatewayHttpReady (K3s path) on the Docker-driver gateway and broke this upgrade test because the Docker-driver gateway returns 404 on GET /. Required to prove the TCP-based probe does not regress stale-gateway restart-after-upgrade behavior.

Optional E2E

  • overlayfs-autofix-e2e: Exercises the same startDockerDriverGateway polling loop that now includes the TCP readiness gate; confirms the auto-fix env plumbing still reaches a healthy gateway under the new ready-gate.
  • issue-2478-crash-loop-recovery-e2e: Soak test for the gateway recovery preload chain. The recovery path calls startGatewayForRecovery → startGatewayWithOptions with exitOnFailure:false, which is the branch that now routes through reportDockerDriverGatewayStartFailure without exiting. Worth running to confirm the non-fatal reporter branch behaves correctly under a flapping gateway.
  • onboard-repair-e2e: Exercises onboard's repair path which re-enters the gateway startup flow; confirms the new TCP-gated health log interacts correctly with repair semantics.
  • onboard-resume-e2e: Resuming an interrupted onboard triggers gateway re-validation. Useful smoke for the reuse-existing-gateway path where isGatewayHealthy runs before any TCP probe.
  • sandbox-survival-e2e: Validates sandbox survival across gateway stop/start. The gateway restart path now depends on the new TCP readiness gate; confidence check that restart is still declared healthy only when the gateway is actually listening.
  • cloud-onboard-e2e: Full public-installer onboard flow; a broad confidence check that end-to-end onboarding (which now includes the TCP ready-gate) still converges on a healthy gateway under normal conditions.

New E2E recommendations

  • onboard/docker-driver-gateway-lifecycle (medium): The existing gateway-health-honest-e2e covers the 'binary crashes immediately' failure mode. It does not cover the inverse false-negative: a Docker-driver gateway that is metadata-healthy AND has a listening TCP socket but on a slow first-accept (e.g. long image pull inside the child) should not be prematurely declared unhealthy. An E2E that injects a delayed-listen gateway shim would guard the new TCP-probe clamp/backoff against regressing into a flaky ready-gate.
    • Suggested test: test/e2e/test-gateway-tcp-ready-slow-start.sh — substitute openshell-gateway with a shim that sleeps N seconds before bind(), assert onboard waits through the poll budget and eventually logs 'Docker-driver gateway is healthy' without false-negative exit.
  • onboard/child-process-supervision (low): trackChildExit is the only place where detached-child exit is observed. A future refactor that accidentally removes child.once('exit', ...) (or swaps in an AbortSignal-based API) would silently reintroduce [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111. A scripted E2E that kills the gateway child with SIGKILL mid-poll and asserts onboard surfaces 'killed by signal SIGKILL before becoming ready' would lock in the describeExit() user-visible contract end-to-end.
    • Suggested test: test/e2e/test-gateway-child-signal-reporting.sh — start onboard with a gateway shim that runs long enough to be killed, externally kill -9 the child, assert onboard prints 'Gateway process killed by signal SIGKILL before becoming ready' and exits non-zero.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e,overlayfs-autofix-e2e,issue-2478-crash-loop-recovery-e2e,onboard-repair-e2e,onboard-resume-e2e,sandbox-survival-e2e,cloud-onboard-e2e

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25706219740
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success

Comment thread src/lib/onboard.ts Fixed
@cv cv closed this May 12, 2026
@cv cv reopened this May 12, 2026
jyaunches added 2 commits May 12, 2026 10:58
…x md lint

Address PR review findings:

1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from
   src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc
   verifyDockerDriverGatewayListening (TCP probe) with a call to the
   shared isGatewayHttpReady helper — HTTP is strictly stronger than
   a raw TCP connect and is already the de-facto standard used by
   every other gateway-reuse decision site in onboard.ts after #3312.
   Docker-driver and K3s paths now converge on the same probe.

2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a
   helper that no longer exists). The shared helper's behavior is
   already covered by test/gateway-http-reuse-wait.test.ts (21 tests).
   Replace with test/gateway-health-honest-integration.test.ts (6
   source-shape tests) guarding the #3111 integration pattern.

3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) —
   the 'checks' job flagged these; now passes locally.

4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is
   'reuse existing helper' rather than 'add new TCP probe helper',
   and the refactoring alignment table records #3312 as 'adopted'
   rather than 'to be coordinated'.

Behavior of the fix is unchanged:
  - child.once('exit') listener tracks zombies (detached children
    that process.kill(pid, 0) would falsely report as alive)
  - poll loop guard is childExited || !isPidAlive(childPid)
  - healthy log gated on isGatewayHttpReady() after isGatewayHealthy()
  - user-visible exit code/signal surfaced on failure

Related: #3111
Coverage guard: #3362 — gateway-health-honest-e2e
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Records gateway child exit state and requires a successful TCP connect probe before marking the Docker-driver gateway healthy; updates startup loop gating and failure diagnostics; adds a TCP readiness helper with unit tests and a source-shape integration test that asserts the new wiring and guards.

Changes

Gateway Health Check Reliability

Layer / File(s) Summary
TCP readiness helper and tests
src/lib/onboard/gateway-tcp-readiness.ts, src/lib/onboard/gateway-tcp-readiness.test.ts
Adds exported isGatewayTcpReady(port?: number, timeoutMs?: number, host?: string): Promise<boolean> that attempts a localhost TCP connect (default 127.0.0.1:GATEWAY_PORT), clamps a minimum timeout, never rejects, and returns boolean. Adds Vitest tests for listening, refused ports, non-routable timeout, minimum-timeout clamping, never-throws semantics, and default-port behavior.
onboard: child-exit tracking & TCP gating
src/lib/onboard.ts
Imports isGatewayTcpReady; registers a one-time child "exit" listener capturing childExitCode/childExitSignal and childExited; updates the startup poll loop to break when childExited or PID-liveness fails; requires both isGatewayHealthy(...) and await isGatewayTcpReady(...) before logging healthy; enhances final failure messages with child-exit diagnostics when observed.
Source-shape integration assertions
test/gateway-health-honest-integration.test.ts
Adds an integration test that extracts startDockerDriverGateway source and asserts exit-listener and childExited presence, poll-loop gating on child-exit/PID-liveness, awaited isGatewayTcpReady(...) before the healthy log, absence of isGatewayHttpReady(...) in the Docker-driver loop, isGatewayTcpReady required from its module, and child-exit details in failure output.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A tiny rabbit heard the socket tap,
And tracked the child that vanished in the gap.
Now probes knock once, and exit flags declare,
No ghostly green when nothing's really there.
🐰 Honest health, with ears to spare.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding TCP readiness verification before reporting the Docker-driver gateway as healthy, which directly addresses the false-positive health reporting issue.
Linked Issues check ✅ Passed The code changes comprehensively address #3111 requirements: detects actual child process exit via 'exit' listener, performs TCP connectivity probe via isGatewayTcpReady, requires both health checks before reporting healthy, and includes exit diagnostics.
Out of Scope Changes check ✅ Passed All changes directly support fixing #3111: onboard.ts modifications implement core fix logic, gateway-tcp-readiness files add the TCP probe, and the test file validates both implementation and source-shape requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-3111-gateway-health-honesty

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25742970106
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

5420-5498: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Replace blocking sleep() with an async delay to allow the event loop to process the child exit event.

The child.once("exit", ...) handler only fires when the event loop processes the SIGCHLD signal. In the polling loop, when the gateway crashes before isGatewayHealthy() reaches the awaited HTTP probe, both sleep() calls (lines 5452 and 5479) block the entire event loop. This prevents childExited from being set by the handler, keeping it false for the entire poll window. The loop then relies on isPidAlive() alone, which treats zombies as alive, defeating the early-exit logic and preventing the new exit-code and exit-signal diagnostics from populating.

Replace each sleep(pollInterval) with an awaitable async delay to unblock the event loop between polling iterations:

Suggested fix
   const pollCount = envInt("NEMOCLAW_HEALTH_POLL_COUNT", 30);
   const pollInterval = envInt("NEMOCLAW_HEALTH_POLL_INTERVAL", 2);
+  const waitForNextPoll = (seconds: number) =>
+    new Promise<void>((resolve) => setTimeout(resolve, seconds * 1000));
   for (let i = 0; i < pollCount; i += 1) {
     // `#3111`: combine the exit-event check with isPidAlive. The exit event
     // catches zombies that isPidAlive misses; isPidAlive catches children
     // we somehow lost track of (defense in depth).
     if (childExited || !isPidAlive(childPid)) {
       break;
     }
     if (!registerDockerDriverGatewayEndpoint()) {
-      if (i < pollCount - 1) sleep(pollInterval);
+      if (i < pollCount - 1) await waitForNextPoll(pollInterval);
       continue;
     }
     const status = runCaptureOpenshell(["status"], { ignoreError: true });
     const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], {
       ignoreError: true,
     });
     const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true });
     if (isGatewayHealthy(status, namedInfo, currentInfo)) {
       if (await isGatewayHttpReady()) {
         console.log("  ✓ Docker-driver gateway is healthy");
         return;
       }
     }
-    if (i < pollCount - 1) sleep(pollInterval);
+    if (i < pollCount - 1) await waitForNextPoll(pollInterval);
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 5420 - 5498, The polling loop uses a
blocking sleep(pollInterval) which prevents the child.once("exit", ...) handler
from running; replace both blocking calls to sleep(pollInterval) inside the for
loop with an awaitable non-blocking delay (e.g. an async delay that wraps
setTimeout) and await it (await delay(pollInterval * <appropriate unit>)) so the
event loop can process SIGCHLD and set childExited/childExitCode; ensure the
enclosing function is async (or already is) so you can await the delay, and
update the two instances where sleep(pollInterval) is called in the loop (the
branch after registerDockerDriverGatewayEndpoint() and the end-of-loop backoff)
to use await delay(...) instead.
🧹 Nitpick comments (2)
test/gateway-health-honest-integration.test.ts (2)

91-93: ⚡ Quick win

Remove the arbitrary 200-char window in the shared-import assertion.

{0,200} can fail on harmless formatting/reordering. Matching the destructuring+require shape directly is more stable and still enforces the same contract.

Proposed regex stabilization
-    expect(content).toMatch(
-      /isGatewayHttpReady,[\s\S]{0,200}?require\(\s*["']\.\/onboard\/gateway-http-readiness["']/,
-    );
+    expect(content).toMatch(
+      /\{[\s\S]*\bisGatewayHttpReady\b[\s\S]*\}\s*=\s*require\(\s*["']\.\/onboard\/gateway-http-readiness["']\s*\)/,
+    );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/gateway-health-honest-integration.test.ts` around lines 91 - 93, The
test's regex uses an arbitrary {0,200} window which can break on harmless
formatting changes; update the assertion in the test that inspects content (the
expect(content).toMatch call) to directly match the destructuring and require
shape for isGatewayHttpReady and ./onboard/gateway-http-readiness without the
{0,200} range — i.e., remove the {0,200} and use a stable pattern that allows
optional whitespace/newlines between "isGatewayHttpReady," and
require("./onboard/gateway-http-readiness") so the assertion targets the exact
destructuring+require pairing.

58-62: ⚡ Quick win

Tighten the healthy-log gating assertion to reduce false positives.

The current check only verifies that some await isGatewayHttpReady( appears before the log string, not that they are coupled in the same decision path.

Proposed assertion hardening
-    const healthyIdx = fnBody.indexOf("Docker-driver gateway is healthy");
-    expect(healthyIdx).toBeGreaterThan(0);
-    const before = fnBody.slice(0, healthyIdx);
-    expect(before).toMatch(/await\s+isGatewayHttpReady\(/);
+    expect(fnBody).toMatch(
+      /if\s*\([\s\S]*await\s+isGatewayHttpReady\([^)]*\)[\s\S]*\)\s*\{[\s\S]{0,300}Docker-driver gateway is healthy/,
+    );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/gateway-health-honest-integration.test.ts` around lines 58 - 62, The
current test only verifies that some "await isGatewayHttpReady(" appears before
the "Docker-driver gateway is healthy" log, which can yield false positives;
change the assertion to find the last occurrence of "await isGatewayHttpReady("
before healthyIdx (use fnBody.lastIndexOf("await isGatewayHttpReady(",
healthyIdx)) and assert that the substring from that index to healthyIdx
includes a decision construct tying them together (e.g., matches
/if\s*\(.*await\s+isGatewayHttpReady\(|if\s*\(\s*await\s+isGatewayHttpReady/ or
that the call is immediately used in an if/ternary), ensuring the awaited call
and the "Docker-driver gateway is healthy" log are in the same control-flow
branch rather than merely earlier in the file; update references to healthyIdx,
fnBody, and isGatewayHttpReady accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ACCEPTANCE.md`:
- Line 12: The ACCEPTANCE.md contains emoji status markers (e.g., 🔴, 🟢, 🟡, ✨)
in technical prose such as the nightly job name gateway-health-honest-e2e and
several other sections; remove all emoji and replace them with plain text status
labels (e.g., "red"/"green"/"yellow"/"beta" or more descriptive words like
"failing"/"passing") consistently across the referenced areas (the job line and
the other occurrences called out in the comment), updating any inline
descriptions so they read as standard text without emojis and preserving the
original meaning and formatting.
- Line 1: Add the required SPDX license header as the very first line of
ACCEPTANCE.md (before the existing "# Issue `#3111` — Development Plan +
Acceptance Criterion" title); insert a single-line SPDX-License-Identifier with
the project's chosen identifier (for example MIT or Apache-2.0) so the file
complies with the `**/*.{js,ts,tsx,sh,md}` header rule and passes linting.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5420-5498: The polling loop uses a blocking sleep(pollInterval)
which prevents the child.once("exit", ...) handler from running; replace both
blocking calls to sleep(pollInterval) inside the for loop with an awaitable
non-blocking delay (e.g. an async delay that wraps setTimeout) and await it
(await delay(pollInterval * <appropriate unit>)) so the event loop can process
SIGCHLD and set childExited/childExitCode; ensure the enclosing function is
async (or already is) so you can await the delay, and update the two instances
where sleep(pollInterval) is called in the loop (the branch after
registerDockerDriverGatewayEndpoint() and the end-of-loop backoff) to use await
delay(...) instead.

---

Nitpick comments:
In `@test/gateway-health-honest-integration.test.ts`:
- Around line 91-93: The test's regex uses an arbitrary {0,200} window which can
break on harmless formatting changes; update the assertion in the test that
inspects content (the expect(content).toMatch call) to directly match the
destructuring and require shape for isGatewayHttpReady and
./onboard/gateway-http-readiness without the {0,200} range — i.e., remove the
{0,200} and use a stable pattern that allows optional whitespace/newlines
between "isGatewayHttpReady," and require("./onboard/gateway-http-readiness") so
the assertion targets the exact destructuring+require pairing.
- Around line 58-62: The current test only verifies that some "await
isGatewayHttpReady(" appears before the "Docker-driver gateway is healthy" log,
which can yield false positives; change the assertion to find the last
occurrence of "await isGatewayHttpReady(" before healthyIdx (use
fnBody.lastIndexOf("await isGatewayHttpReady(", healthyIdx)) and assert that the
substring from that index to healthyIdx includes a decision construct tying them
together (e.g., matches
/if\s*\(.*await\s+isGatewayHttpReady\(|if\s*\(\s*await\s+isGatewayHttpReady/ or
that the call is immediately used in an if/ternary), ensuring the awaited call
and the "Docker-driver gateway is healthy" log are in the same control-flow
branch rather than merely earlier in the file; update references to healthyIdx,
fnBody, and isGatewayHttpReady accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70d3075d-2157-4107-96aa-f203a903950a

📥 Commits

Reviewing files that changed from the base of the PR and between edb7478 and 29bba3e.

📒 Files selected for processing (3)
  • ACCEPTANCE.md
  • src/lib/onboard.ts
  • test/gateway-health-honest-integration.test.ts

Comment thread ACCEPTANCE.md Outdated
Comment thread ACCEPTANCE.md Outdated
- Add SPDX license header (HTML-comment style, matching tools/e2e-advisor/README.md)
- Replace emoji status markers (🔴/🟢/🟡/✨) with plain-text labels
  ([E2E], [OK], [Unit]) per the coding guideline 'No emoji in technical prose'.
  Literal '✓' characters in log-output examples are retained because they
  are not decorative — they are the actual bytes the onboard code prints.
Comment thread ACCEPTANCE.md Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be in the PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — removed in e6c08a7. ACCEPTANCE.md was a pr-e2e-loop workflow artifact (the skill writes it as a handoff from the coverage-guard PR to the fix PR). The relevant acceptance content — "gateway-health-honest-e2e must flip from red on main to green on this branch" — is already captured in the PR description and the fix commit message, so the file wasn't adding anything durable. I'll update the skill to either not emit this file into the repo or write it under a gitignored path.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25747662365
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success
openshell-gateway-upgrade-e2e ❌ failure

Failed jobs: openshell-gateway-upgrade-e2e. Check run artifacts for logs.

@jyaunches
Copy link
Copy Markdown
Contributor Author

⚠️ Targeted E2E validation requested by the E2E Advisor is not green yet.

Run: https://github.com/NVIDIA/NemoClaw/actions/runs/25747662365

Results:

  • gateway-health-honest-e2e: passed
  • openshell-gateway-upgrade-e2e: failed

Please investigate/fix the failing E2E path and rerun the targeted validation. This PR should not proceed until the advisor-required E2E set is passing.

To have an agent rerun this after fixes, use a prompt like:

For NemoClaw PR #3378, update the branch from main if conflict-free, then dispatch .github/workflows/nightly-e2e.yaml on branch issue-3111-gateway-health-honesty with jobs=gateway-health-honest-e2e,openshell-gateway-upgrade-e2e. Monitor the run, report pass/fail, and if it fails, inspect artifacts/logs and help fix the failure before rerunning.

Note: I attempted to submit this as a formal “request changes” review, but GitHub does not allow requesting changes on your own PR from this account, so posting as a blocking comment instead.

@wscurran wscurran added Docker Support for Docker containerization fix labels May 12, 2026
…iness

Addresses the failing openshell-gateway-upgrade-e2e surfaced by the
e2e-advisor on PR #3378.

## Problem

The previous commit converged the Docker-driver gateway's readiness
probe onto isGatewayHttpReady from ./onboard/gateway-http-readiness
(introduced by #3312). That broke openshell-gateway-upgrade-e2e:

    Starting OpenShell gateway...
    [PASS] Stale gateway is healthy with supervisor:0.0.36
    [INFO] Invoking NemoClaw gateway start path; it must restart the stale process
    Docker-driver gateway failed to start.    <-- 60s timeout

The gateway log showed the process was actually serving correctly:
    POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0
    POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0
    GET  /                                  response status=404 latency_ms=0

## Root cause

isGatewayHttpReady does GET /, and only accepts {200, 401}. The K3s-path
gateway answers GET / via a dispatcher catch-all. The Docker-driver
gateway, by contrast, only serves routes under /openshell.v1.OpenShell/*
and returns 404 for GET /. So the probe always fails on the Docker-
driver path regardless of whether the gateway is up.

## Fix

Revert to a plain TCP probe on the gateway port for the Docker-driver
path. Add a small isDockerDriverGatewayTcpReady(port, timeoutMs, host)
helper next to isPidAlive. This is semantic-free — it just asks "is
anyone listening" — which is exactly what's needed to detect the #3111
failure mode (crashed binary → nothing listening) without making
assumptions about HTTP route shape.

isGatewayHttpReady is left in place for all K3s-path call sites; the
two paths use different probes because the two gateway types expose
different HTTP surfaces. TODO(#3213) filed for future convergence.

## Tests

test/gateway-health-honest-integration.test.ts now has 10 tests:
  5 behavioural: listening / closed / timeout / min-timeout / no-throw
  5 source-shape: child-exit tracking, poll-loop guard, TCP probe call
    site, regression guard against reintroducing the HTTP probe in the
    Docker-driver loop, exit-details surfaced

## Expected CI result

  - gateway-health-honest-e2e: GREEN (already was — this was never affected)
  - openshell-gateway-upgrade-e2e: FLIPS BACK to GREEN (the breakage this
    commit addresses)

Related: #3111
Broke in: 29bba3e (refactor(onboard): converge on shared isGatewayHttpReady)
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25752904986
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

5466-5492: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle spawn() failures before the PID check to preserve exec error details.

If openshell-gateway cannot exec, Node reports this through the child "error" event, but the current code only listens for "exit". When exec fails, pid remains unset and we throw a generic "did not return a pid" error, losing the real failure reason. The startModelRouter() function in this file already handles both "error" and "exit" events and includes the error detail in the PID-check message. Apply the same pattern here.

Suggested fix
   const child = spawn(gatewayBin, [], {
     detached: true,
     stdio: ["ignore", outFd, errFd],
     env: {
       ...process.env,
       ...gatewayEnv,
     },
   });
   // Track early-exit via the ChildProcess event rather than process.kill(pid, 0).
   // isPidAlive() returns true for zombies (detached children that exited but
   // were not reaped by a wait() call), which masks the failure mode reported
   // in `#3111`: the binary crashes on startup but the pid remains in the
   // process table as a zombie. The 'exit' event fires reliably for detached
   // children once libuv observes SIGCHLD. We still call unref() below so the
   // child doesn't hold the Node event loop open after we return.
   let childExited = false;
   let childExitCode: number | null = null;
   let childExitSignal: NodeJS.Signals | null = null;
+  let childStartError: string | null = null;
+  child.once("error", (error: NodeJS.ErrnoException) => {
+    childExited = true;
+    childStartError = error.message;
+  });
   child.once("exit", (code: number | null, signal: NodeJS.Signals | null) => {
     childExited = true;
     childExitCode = code;
     childExitSignal = signal;
   });
   child.unref();
   const childPid = child.pid ?? 0;
   if (childPid <= 0) {
+    await new Promise((resolve) => setImmediate(resolve));
-    throw new Error("OpenShell gateway process did not return a pid");
+    throw new Error(
+      childStartError
+        ? `OpenShell gateway process failed to start: ${childStartError}`
+        : "OpenShell gateway process did not return a pid",
+    );
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 5466 - 5492, The code currently only listens
for the child's "exit" event and then throws a generic error when child.pid is
missing; register an "error" handler on the spawned process (the same way
startModelRouter() does) to capture exec errors from spawn(gatewayBin, ...) into
a variable (e.g., childExecError) and include that error's message/details when
throwing the "OpenShell gateway process did not return a pid" error; update the
PID-check logic around child.pid to reference the captured error and include it
in the thrown Error so exec failures are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/gateway-health-honest-integration.test.ts`:
- Around line 92-97: The test currently only asserts an upper bound on elapsed
time and can miss regressions where timeout is clamped to zero; update the test
around isDockerDriverGatewayTcpReady(port, 0) to assert that the elapsed time is
at least ~50ms (verifying the minimum timeout floor) and still less than the
existing 500ms upper bound by measuring Date.now() - started, i.e. ensure the
assertion checks both a lower bound (>=50ms) and the existing upper bound to
confirm the 50ms clamp is enforced.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5466-5492: The code currently only listens for the child's "exit"
event and then throws a generic error when child.pid is missing; register an
"error" handler on the spawned process (the same way startModelRouter() does) to
capture exec errors from spawn(gatewayBin, ...) into a variable (e.g.,
childExecError) and include that error's message/details when throwing the
"OpenShell gateway process did not return a pid" error; update the PID-check
logic around child.pid to reference the captured error and include it in the
thrown Error so exec failures are preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b19416e4-11dc-4e7d-995e-0a877c730853

📥 Commits

Reviewing files that changed from the base of the PR and between a5fad4e and c93a5fe.

📒 Files selected for processing (3)
  • ACCEPTANCE.md
  • src/lib/onboard.ts
  • test/gateway-health-honest-integration.test.ts
✅ Files skipped from review due to trivial changes (1)
  • ACCEPTANCE.md

Comment thread test/gateway-health-honest-integration.test.ts Outdated
jyaunches added 3 commits May 12, 2026 14:29
Addresses two review findings on PR #3378:

1. cv (human): 'This shouldn't be in the PR' on ACCEPTANCE.md.
   Fair call — ACCEPTANCE.md is a pr-e2e-loop workflow artifact, not
   a code deliverable the repo should carry forward. The acceptance
   criterion and green-flip contract already live in the PR body and
   the commit messages on this branch.

2. CodeRabbit (Minor): the 'enforces a minimum timeout of 50ms' test
   only verified an upper bound, so a regression that removed the
   Math.max(50, timeoutMs) clamp and returned immediately would still
   pass. Rewrite the test to use a non-routable host (10.255.255.1)
   so an immediate ECONNREFUSED can't mask the regression, and assert
   elapsed >= 40 ms (10 ms slack under the 50 ms floor) AND < 2000 ms
   (generous ceiling for slow CI).
Addresses PR review feedback — onboard.ts is the God Object being
decomposed, and growing its exports is an anti-pattern. Follow the
pattern established by PR #3312 (which added
src/lib/onboard/gateway-http-readiness.ts) and create a peer module
for the Docker-driver path.

New file: src/lib/onboard/gateway-tcp-readiness.ts
  Exports: isGatewayTcpReady(port, timeoutMs, host)
  Model: mirrors gateway-http-readiness.ts structure exactly —
         default port from GATEWAY_PORT, minimum-timeout clamp,
         pure async, never throws.

New file: src/lib/onboard/gateway-tcp-readiness.test.ts
  Co-located behavioural tests per the established pattern
  (see temp-files.test.ts, legacy-command.test.ts, etc.)
  6 tests: listening / closed / timeout / min-timeout / no-throw /
  default-port. Imports via ES module, not via dist/ indirection.

onboard.ts net change:
  Before: +88 lines (+48 helper + integration)
  After:  +22 lines (2 require lines + integration only)

Related to: #3111
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25759369254
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/lib/onboard/gateway-tcp-readiness.ts (1)

96-100: ⚡ Quick win

Drop unnecessary defensive try/catch around socket.destroy().

This catch is likely redundant for this internal path and can mask unexpected behavior.

Proposed simplification
-      try {
-        socket.destroy();
-      } catch {
-        /* best effort */
-      }
+      socket.destroy();
Based on learnings: "avoid adding defensive error handling ... around internal helper logic when there is no realistic throwing path ... remove unnecessary try/catch."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/gateway-tcp-readiness.ts` around lines 96 - 100, Remove the
unnecessary defensive try/catch around socket.destroy(): locate the
socket.destroy() call in gateway-tcp-readiness (the block showing try {
socket.destroy(); } catch { /* best effort */ }) and replace it with a plain
socket.destroy() invocation so any unexpected errors aren't silently swallowed;
ensure no other logic depends on the catch block and keep the surrounding
function/flow unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5290-5296: Trim the long inline rationale block around the
ChildProcess handling to a single short comment referencing `#3111` and the key
behavior (use 'exit' event instead of isPidAlive() to avoid zombies) and remove
the extended explanation about libuv, SIGCHLD, and unref(); keep the existing
behavior and any calls to unref() and the 'exit' listener intact (search for
isPidAlive(), ChildProcess, unref(), and the 'exit' event in this area) and move
the detailed rationale into the extracted helper/tests or documentation so the
source stays under the entrypoint budget without changing runtime logic.
- Around line 5331-5347: The reuse paths that simply log "✓ Reusing existing
Docker-driver gateway" (and the branch that skips Step 2 during resume/onboard)
must also verify actual TCP readiness via isGatewayTcpReady() before
returning/succeeding; update the code in the reuse/skip branches to call await
isGatewayTcpReady(), only log the success and return when it resolves true, and
if it returns false continue to run the existing polling/start logic (or fall
through to the fresh-spawn loop) and emit a warning or retry instead of a
false-positive success. Ensure you modify the branches that currently
short-circuit (the "Reusing existing Docker-driver gateway" return) and the Step
2 skip branch so both reference isGatewayTcpReady() and behave consistently with
the fresh-spawn path.

---

Nitpick comments:
In `@src/lib/onboard/gateway-tcp-readiness.ts`:
- Around line 96-100: Remove the unnecessary defensive try/catch around
socket.destroy(): locate the socket.destroy() call in gateway-tcp-readiness (the
block showing try { socket.destroy(); } catch { /* best effort */ }) and replace
it with a plain socket.destroy() invocation so any unexpected errors aren't
silently swallowed; ensure no other logic depends on the catch block and keep
the surrounding function/flow unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7a6b0bc3-1c0c-4d4b-8791-34632f2f6db6

📥 Commits

Reviewing files that changed from the base of the PR and between e6c08a7 and 7917b0c.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-tcp-readiness.test.ts
  • src/lib/onboard/gateway-tcp-readiness.ts
  • test/gateway-health-honest-integration.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
…odules

Addresses PR review: 'This PR is adding lines to onboard.ts — which is
now an antipattern' and the onboard-entrypoint-budget CI check.

src/lib/onboard.ts is the God Object being decomposed; growing it is
explicitly flagged as "away from target architecture" in the repo's
review guidelines, and the onboard-entrypoint-budget workflow fails the
PR if the entrypoint grows net-positive.

## Changes

Extract three focused modules under src/lib/onboard/ that collectively
replace the inline logic added in earlier commits on this branch:

- src/lib/onboard/gateway-tcp-readiness.ts
    isGatewayTcpReady(port, timeoutMs, host): plain TCP liveness probe
    for the Docker-driver gateway. Peer of gateway-http-readiness.ts
    (introduced by #3312). Uses a minimum-timeout clamp so a regression
    that drops Math.max(50, timeoutMs) is caught by tests.

- src/lib/onboard/child-exit-tracker.ts
    trackChildExit(child): attaches a one-shot 'exit' listener to a
    ChildProcess and returns a ChildExitState view. Zombie-safe: fires
    reliably for detached children that isPidAlive(pid, 0) would
    falsely report as alive (#3111). Also exposes describeExit() —
    'exited with code 127' / 'killed by signal SIGKILL' — so callers
    don't have to reformat the signal/code themselves.

- src/lib/onboard/docker-driver-gateway-failure.ts
    reportDockerDriverGatewayStartFailure(logPath, childExit, opts):
    the full diagnostic bundle emitted on startup failure — 'failed to
    start' header, exit descriptor, redacted log tail, troubleshooting
    footer. Honors exitOnFailure (production) vs. throw-only (recovery).

Each module has co-located behavioural tests (6 + 5 + 6 tests = 17),
using direct ES imports instead of the dist/ indirection that
test/onboard.test.ts uses. Source-shape guards in
test/gateway-health-honest-integration.test.ts enforce that the
three modules stay extracted (the entrypoint cannot reintroduce the
inline implementations).

## Result

  src/lib/onboard.ts: 15 additions, 21 deletions — net -6 lines.
  src/lib/onboard/**: 3 new modules, 17 new unit tests, full behavioural
      + signal-vs-code coverage on all three helpers.

Both E2E tests remain green on this branch:
  - gateway-health-honest-e2e (#3111 coverage guard)
  - openshell-gateway-upgrade-e2e (regression target the earlier
    HTTP-probe attempt broke; TCP probe is the correct fit)

Related: #3111
Aligns with architecture goal tracked by onboard-entrypoint-budget CI.
TS7006 under strict mode — the implicit 'any' on errSpy.mock.calls.map's
callback parameter. No behavior change.
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25760908563
Branch: issue-3111-gateway-health-honesty
Requested jobs: gateway-health-honest-e2e,openshell-gateway-upgrade-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
gateway-health-honest-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success

@cv cv merged commit b2ecb84 into main May 12, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization fix

Projects

None yet

4 participants