Skip to content

test(e2e): add gateway-health-honest coverage guard for #3111#3362

Merged
jyaunches merged 4 commits into
mainfrom
test/gateway-health-honest-e2e-guard
May 11, 2026
Merged

test(e2e): add gateway-health-honest coverage guard for #3111#3362
jyaunches merged 4 commits into
mainfrom
test/gateway-health-honest-e2e-guard

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 11, 2026

Coverage guard for #3111 — "Docker-driver gateway is healthy" false-positive

This PR adds an E2E regression test that fails on main today. It is
intentional that this test will be red on main and nightly will go
red
until the fix for #3111 lands.

The gap

Issue #3111 reports that on Ubuntu 22.04 the onboard flow prints:

Starting OpenShell Docker-driver gateway...
Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
✓ Docker-driver gateway is healthy    ← false positive

while the gateway log shows the binary never actually ran:

openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found
openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found

The underlying NemoClaw bug is platform-independent:

  • startGateway() (src/lib/onboard.ts:5500+) spawns the gateway
    binary with detached: true and child.unref(). When the binary
    crashes, the detached child becomes a zombie, and isPidAlive()
    (process.kill(pid, 0)) returns true for zombies — so the poll loop
    doesn't break.
  • registerDockerDriverGatewayEndpoint() (src/lib/onboard.ts:4347) is
    metadata-only: openshell gateway add --local --name nemoclaw <url>
    writes the endpoint to the config; it does NOT probe the endpoint.
  • isGatewayHealthy() (src/lib/state/gateway.ts:99) is a string
    match on openshell status and openshell gateway info output
    , not
    a live health check.

Result: on any Linux host where the gateway binary fails to start for
any reason (GLIBC mismatch, missing shared lib, permissions, OOM,
CDI-spec error, corrupted binary…), onboard reports
✓ Docker-driver gateway is healthy and proceeds to the next onboard
step, which then fails with a confusing Connection refused downstream.

There is an existing openshell-gateway-upgrade-e2e test covering the
stale-gateway-replaced path for PR #3001, but no test covers the
gateway-binary-crashes path that is the root issue in #3111.

What this test does

test/e2e/test-gateway-health-honest.sh:

  1. Installs the real openshell + openshell-gateway binaries via
    scripts/install-openshell.sh (same setup path as the existing
    upgrade test).
  2. Drops a sabotage shim at $STATE_DIR/openshell-gateway-sabotage that
    exits immediately with the same GLIBC-style stderr reported in [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111.
  3. Invokes startGateway(null) via a Node heredoc, with
    NEMOCLAW_OPENSHELL_GATEWAY_BIN pointing at the shim.
  4. Asserts:
    • Primary: the onboard output does NOT contain
      "Docker-driver gateway is healthy".
    • Corroborating: the node process exits non-zero (≠ 0).
    • Corroborating: onboard surfaces a user-visible failure line
      (failed to start, crash, exit, not found, or a thrown
      exception).
    • Corroborating: no live non-zombie gateway process remains
      after the simulated crash.

The test runs on ubuntu-latest and does not require an Ubuntu 22.04
runner — it exercises the NemoClaw-side bug class, not the OpenShell-side
GLIBC packaging choice. The GLIBC compatibility concern is an OpenShell
team issue and is out of scope for this coverage guard.

Expected CI behavior

Ref Nightly job gateway-health-honest-e2e
this PR PR checks don't run nightly, so this PR's CI is green (modulo the usual unit/lint suite).
main (after merge) FAILS. Primary assertion trips: onboard logs ✓ Docker-driver gateway is healthy.
any branch that fixes #3111 PASSES.

The red-nightly tradeoff

Once merged, the nightly badge will go red on gateway-health-honest-e2e
until #3111 is fixed. That is the point — the failing test is the
executable acceptance criterion for the fix. A subsequent PR authored via
/skill:nemoclaw-issue-kickoff 3111 will produce the fix with this test
as its definition-of-done.

Expected failure output on main

[FAIL] Onboard reported '✓ Docker-driver gateway is healthy' although
the gateway binary crashed on startup (#3111 false-positive health check)
[DIAG] start log tail:
  Starting OpenShell Docker-driver gateway...
  Gateway log: /home/runner/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
  ✓ Docker-driver gateway is healthy
  __onboard_startGateway_returned_successfully__
[DIAG] gateway log tail:
  openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found
  openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found

Wiring

  • New file: test/e2e/test-gateway-health-honest.sh (~170 LOC, modeled
    after test/e2e/test-openshell-gateway-upgrade.sh)
  • New job: gateway-health-honest-e2e in .github/workflows/nightly-e2e.yaml
    (6 edits: comment, inputs.jobs description, new job block, 3 needs arrays)

References

Summary by CodeRabbit

  • Tests
    • Added an end-to-end test that verifies gateway start failure is detected, exposes user-visible error output, and ensures no lingering gateway processes remain.
  • Chores
    • Integrated the new test into the nightly pipeline as a selectable job, including artifact upload on failure and inclusion in failure reporting, PR comments, and the nightly scorecard.
  • Documentation
    • Updated nightly workflow job selection documentation and added a review-tool entry linking the test to the nightly job.

Review Change Stack

Review Change Stack

Adds a failing E2E test that demonstrates the #3111 false-positive
"Docker-driver gateway is healthy" bug.

Until the fix for #3111 lands, the nightly gateway-health-honest-e2e
job will fail. This is intentional — the failing test is the proof of
coverage and the executable acceptance criterion for #3111.

The bug reported on Ubuntu 22.04 (GLIBC 2.35 vs shipped binary
linked against GLIBC 2.38/2.39) is a specific instance of a
platform-independent NemoClaw bug: startGateway() spawns a detached
child, the crashed child remains a zombie, isPidAlive() returns true
for zombies, registerDockerDriverGatewayEndpoint() writes metadata
without probing, and isGatewayHealthy() is a string match on
openshell CLI output rather than a live health check. Result:
onboard logs "healthy" regardless of whether the gateway actually runs.

The test sabotages the gateway binary (via NEMOCLAW_OPENSHELL_GATEWAY_BIN)
with a shim that matches the #3111 failure mode, then asserts:
  - primary: log does NOT contain 'Docker-driver gateway is healthy'
  - corroborating: node process exits non-zero
  - corroborating: user-visible failure message surfaced
  - corroborating: no live non-zombie gateway process remains

Runs on ubuntu-latest — the test exercises the NemoClaw-side false-
positive, not the OpenShell-side GLIBC packaging.

Related: #3111
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffc1ff5f-9c6e-465c-8fcb-0d5e6c71ca1d

📥 Commits

Reviewing files that changed from the base of the PR and between 4e651ad and 3f39599.

📒 Files selected for processing (1)
  • test/e2e/test-gateway-health-honest.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-gateway-health-honest.sh

📝 Walkthrough

Walkthrough

Adds a nightly workflow job and a new end-to-end test that simulates an openshell-gateway GLIBC startup crash, asserts the gateway is not reported "healthy", checks for failure indicators, and ensures no live non-zombie gateway process remains.

Changes

Coverage Guard for Issue #3111: False-Positive Gateway Health Status

Layer / File(s) Summary
Workflow Documentation
.github/workflows/nightly-e2e.yaml
Adds gateway-health-honest-e2e to the workflow dispatch job list documentation and input validation.
Workflow Job Definition
.github/workflows/nightly-e2e.yaml
Introduces gateway-health-honest-e2e job with Node.js 22 setup, test script execution, and failure artifact uploads.
Workflow Result Aggregation
.github/workflows/nightly-e2e.yaml
Wires the new job into notify-on-failure, report-to-pr, and scorecard so its status participates in failure notification, PR reporting, and nightly scorecard.
CodeRabbit Instruction
.coderabbit.yaml
Adds reviews.path_instructions entry for the new test and references the gateway-health-honest-e2e job for selective runs.
E2E Test Scaffolding and Setup
test/e2e/test-gateway-health-honest.sh
Provides test header with issue #3111 rationale, logging/failure helpers, environment initialization, cleanup traps, and bootstrap validation.
E2E Test Condition and Execution
test/e2e/test-gateway-health-honest.sh
Creates a sabotage gateway shim that emits GLIBC stderr and exits with code 127, then runs startGateway() under a Node harness with the sabotaged binary injected, capturing output and exit code.
E2E Test Assertions
test/e2e/test-gateway-health-honest.sh
Verifies sabotage evidence exists, the "✓ Docker-driver gateway is healthy" message is absent, the Node process exited non-zero and emits no success marker, a user-visible failure indicator appears, and any pid file does not correspond to a live non-zombie process; prints final pass message.

Sequence Diagram(s)

sequenceDiagram
  participant NodeHarness as Node test harness
  participant SabotageBin as Sabotage gateway shim
  participant Filesystem as Logs / PID file
  NodeHarness->>SabotageBin: spawn via NEMOCLAW_OPENSHELL_GATEWAY_BIN
  SabotageBin-->>NodeHarness: stderr GLIBC_2.38/2.39 not found + exit 127
  NodeHarness->>Filesystem: write start log and PID file
  NodeHarness->>Filesystem: capture exit code and log contents
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I’m a rabbit with a tiny test-bench hat,
I fake a GLIBC crash and watch logs chat,
"Healthy" lies vanish, the gateway won’t play,
Now nightly watches guard the break of day,
Hooray—#3111 sleeps easy at last. 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(e2e): add gateway-health-honest coverage guard for #3111' clearly and specifically describes the main change: adding a regression test for issue #3111.
Linked Issues check ✅ Passed The PR adds a comprehensive end-to-end test that directly validates the acceptance criterion from #3111: the test must fail until the issue is fixed, ensuring gateway startup failures are surfaced and not falsely reported as healthy.
Out of Scope Changes check ✅ Passed All changes are directly scoped to adding the regression test for #3111: new e2e test script, workflow job integration, and CodeRabbit configuration—no extraneous modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/gateway-health-honest-e2e-guard

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25696214021
Branch: test/gateway-health-honest-e2e-guard
Requested jobs: gateway-health-honest-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
gateway-health-honest-e2e ❌ failure

Failed jobs: gateway-health-honest-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 1271-1304: Add a new path_instructions mapping in .coderabbit.yaml
for the gateway-health-honest-e2e job: create an entry keyed by
"gateway-health-honest-e2e" that lists "test/e2e/test-gateway-health-honest.sh"
as the covered source file and includes a brief instruction string describing
that this job runs the gateway health-honesty E2E test (e.g., "covers gateway
health-honesty E2E: test-gateway-health-honest.sh"); ensure the key matches the
job name gateway-health-honest-e2e and the path exactly matches
test/e2e/test-gateway-health-honest.sh so the test/validate-e2e-coverage.test.ts
cross-validation passes.

In `@test/e2e/test-gateway-health-honest.sh`:
- Line 37: The script uses "set -uo pipefail" but lacks -e, so add -e to make it
"set -euo pipefail" to fail fast on errors; then harden the assertions around
the existing acceptance of generic "not found" (the blocks currently at ~lines
110-116 and ~188-190) by replacing broad string matches with strict checks:
assert exact expected error messages or explicit exit codes from the command
under test (or use command -v/which to verify binaries exist before testing),
and remove any "|| true" or permissive greps that allow unrelated setup failures
to produce a false-green pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6db13377-1dc3-4ed1-bf2a-d1fefe9c9690

📥 Commits

Reviewing files that changed from the base of the PR and between f98ad00 and 1abe84c.

📒 Files selected for processing (2)
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/test-gateway-health-honest.sh

Comment thread .github/workflows/nightly-e2e.yaml
Comment thread test/e2e/test-gateway-health-honest.sh Outdated
jyaunches added 2 commits May 11, 2026 16:51
Pure whitespace/redirect formatting applied by the repo's shfmt hook.
No behavioral change.
Addresses CodeRabbit review feedback on #3362:

1. Add `.coderabbit.yaml` path_instructions entry for the new
   test/e2e/test-gateway-health-honest.sh script so validate-e2e-coverage
   cross-validation passes and future reviewers get guidance on how to
   dispatch the job selectively.

2. Harden the test against false-green passes from unrelated setup
   errors:
     - switch `set -uo pipefail` → `set -euo pipefail` so that
       npm-ci / build:cli / install-openshell.sh failures fail the test
       script immediately instead of letting downstream assertions
       run against an empty log;
     - restructure the cleanup() pid-read so the `[ -f ... ] && ...`
       pattern doesn't fail the script under -e when the pid file is
       absent;
     - add a pre-assertion that positively proves the sabotage shim
       was invoked (GLIBC_2.38/2.39 or openshell-gateway-sabotage
       markers in the start log); without this, a stale dist/ or a
       broken build could satisfy the primary 'no healthy' assertion
       without exercising the gateway-failure code path;
     - narrow corroborating assertion 2 to exclude generic 'not found'
       which could be satisfied by an unrelated module-not-found.

No change to the primary assertion or the red-on-main behavior for #3111.
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25697935007
Branch: test/gateway-health-honest-e2e-guard
Requested jobs: gateway-health-honest-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
gateway-health-honest-e2e ❌ failure

Failed jobs: gateway-health-honest-e2e. Check run artifacts for logs.

The sabotage shim's stderr is written to the gateway log file opened by
onboard.ts:startGatewayWithOptions ($STATE_DIR/openshell-gateway.log),
not to the start log which only captures node's stdout/stderr.

The first nightly dispatch against the hardened version surfaced this:
the pre-assertion correctly caught that the markers were missing from
$START_LOG — but they were actually present in the right log file all
along. Fix: check the authoritative gateway log.

Also expand fail() diagnostics to dump the onboard gateway log tail.
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25698031380
Branch: test/gateway-health-honest-e2e-guard
Requested jobs: gateway-health-honest-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
gateway-health-honest-e2e ❌ failure

Failed jobs: gateway-health-honest-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e/test-gateway-health-honest.sh (1)

98-107: ⚡ Quick win

Add marker file to cleanup if implemented.

If you implement the marker file approach, ensure it's cleaned up:

 cleanup() {
   set +e
   if [ -f "$PID_FILE" ]; then
     CHILD_PID="$(tr -d '[:space:]' <"$PID_FILE")"
   fi
   cleanup_pid "$CHILD_PID"
   openshell gateway remove nemoclaw >/dev/null 2>&1 || true
-  rm -f "$PID_FILE" "$SABOTAGE_BIN"
+  rm -f "$PID_FILE" "$SABOTAGE_BIN" "${STATE_DIR}/.sabotage-marker"
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-gateway-health-honest.sh` around lines 98 - 107, The cleanup
function currently removes PID_FILE and SABOTAGE_BIN but doesn't remove the
optional marker file if you implemented the marker approach; update the
cleanup() implementation (the cleanup function and the rm -f invocation
referencing "$PID_FILE" and "$SABOTAGE_BIN") to also remove the marker file
(e.g., "$MARKER_FILE" or whatever variable/name you used for the marker) and
ensure any conditional references (like checking if the marker exists) are
handled so the marker is always cleaned up on EXIT.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-gateway-health-honest.sh`:
- Around line 129-140: The sabotage shim currently only prints GLIBC markers to
stderr (written by the helper at SABOTAGE_BIN) which get lost because
startGateway() detaches the child; modify the shim to also create a simple
marker file (e.g., "${STATE_DIR}/.sabotage-marker") when executed so there's a
persistent proof the binary ran, then update the pre-assertion that checks
START_LOG to accept either the existing stderr grep OR the presence of that
MARKER_FILE and remove the marker after the check; reference SABOTAGE_BIN
(shim), startGateway(), START_LOG, STATE_DIR and MARKER_FILE when making these
changes.
- Around line 169-179: Update the pre-assertion in the block that inspects
START_LOG (the if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage'
"$START_LOG"; then ...) to also accept the sabotage shim's marker file: check
for a marker path (e.g., SABOTAGE_MARKER or SABOTAGE_MARKER_FILE) and consider
the test exercised if that file exists; if neither the grep finds the
GLIBC/openshell marker nor the marker file exists then call fail with the
existing message, otherwise call pass (and optionally adjust the pass message to
mention the marker file when present). Ensure you reference START_LOG, the grep
check, the fail and pass calls when making the change.

---

Nitpick comments:
In `@test/e2e/test-gateway-health-honest.sh`:
- Around line 98-107: The cleanup function currently removes PID_FILE and
SABOTAGE_BIN but doesn't remove the optional marker file if you implemented the
marker approach; update the cleanup() implementation (the cleanup function and
the rm -f invocation referencing "$PID_FILE" and "$SABOTAGE_BIN") to also remove
the marker file (e.g., "$MARKER_FILE" or whatever variable/name you used for the
marker) and ensure any conditional references (like checking if the marker
exists) are handled so the marker is always cleaned up on EXIT.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1d726147-0282-46a2-989e-43a21bbc491b

📥 Commits

Reviewing files that changed from the base of the PR and between bb3f474 and 4e651ad.

📒 Files selected for processing (2)
  • .coderabbit.yaml
  • test/e2e/test-gateway-health-honest.sh
✅ Files skipped from review due to trivial changes (1)
  • .coderabbit.yaml

Comment thread test/e2e/test-gateway-health-honest.sh
Comment thread test/e2e/test-gateway-health-honest.sh Outdated
Comment on lines +169 to +179
info "node exit code: ${NODE_EXIT}"

# ── Pre-assertion: prove the sabotage path was actually exercised ───
# Without this guard, an unrelated setup failure (module-not-found,
# missing env, stale dist/, etc.) could produce a log that happens to
# lack the 'healthy' string and thereby false-green the primary
# assertion. We require positive evidence that the sabotage shim ran.
if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then
fail "Sabotage markers (GLIBC_2.38/2.39 or 'openshell-gateway-sabotage') not observed in start log — the test may have failed before the sabotaged gateway was invoked, so the assertions below cannot be trusted. Inspect the start log above for the real cause."
fi
pass "Sabotage shim was invoked as expected (GLIBC/sabotage markers present in log)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Pre-assertion logic is sound; needs marker file fallback.

The guard against false-green from unrelated setup failures is well-designed. Once the sabotage shim is updated to write a marker file (per the fix above), update this check:

-if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then
+MARKER_FILE="${STATE_DIR}/.sabotage-marker"
+if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG" && [ ! -f "$MARKER_FILE" ]; then
   fail "Sabotage markers ... not observed in start log ..."
 fi
+rm -f "$MARKER_FILE"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-gateway-health-honest.sh` around lines 169 - 179, Update the
pre-assertion in the block that inspects START_LOG (the if ! grep -qE
'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then ...) to also
accept the sabotage shim's marker file: check for a marker path (e.g.,
SABOTAGE_MARKER or SABOTAGE_MARKER_FILE) and consider the test exercised if that
file exists; if neither the grep finds the GLIBC/openshell marker nor the marker
file exists then call fail with the existing message, otherwise call pass (and
optionally adjust the pass message to mention the marker file when present).
Ensure you reference START_LOG, the grep check, the fail and pass calls when
making the change.

@wscurran wscurran added fix E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps labels May 11, 2026
@wscurran
Copy link
Copy Markdown
Contributor

@wscurran wscurran added the Docker Support for Docker containerization label May 11, 2026
@jyaunches jyaunches merged commit 32cab9d into main May 11, 2026
64 of 65 checks passed
jyaunches added a commit that referenced this pull request May 12, 2026
…x md lint

Address PR review findings:

1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from
   src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc
   verifyDockerDriverGatewayListening (TCP probe) with a call to the
   shared isGatewayHttpReady helper — HTTP is strictly stronger than
   a raw TCP connect and is already the de-facto standard used by
   every other gateway-reuse decision site in onboard.ts after #3312.
   Docker-driver and K3s paths now converge on the same probe.

2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a
   helper that no longer exists). The shared helper's behavior is
   already covered by test/gateway-http-reuse-wait.test.ts (21 tests).
   Replace with test/gateway-health-honest-integration.test.ts (6
   source-shape tests) guarding the #3111 integration pattern.

3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) —
   the 'checks' job flagged these; now passes locally.

4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is
   'reuse existing helper' rather than 'add new TCP probe helper',
   and the refactoring alignment table records #3312 as 'adopted'
   rather than 'to be coordinated'.

Behavior of the fix is unchanged:
  - child.once('exit') listener tracks zombies (detached children
    that process.kill(pid, 0) would falsely report as alive)
  - poll loop guard is childExited || !isPidAlive(childPid)
  - healthy log gated on isGatewayHttpReady() after isGatewayHealthy()
  - user-visible exit code/signal surfaced on failure

Related: #3111
Coverage guard: #3362 — gateway-health-honest-e2e
cv added a commit that referenced this pull request May 12, 2026
…eporting healthy (#3111) (#3378)

## Summary

`nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy`
even when the openshell-gateway binary crashed immediately on startup,
then failing the next step with `Connection refused`. The reported
trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped
binary, but the underlying NemoClaw bug is **platform-independent** —
any reason the binary fails to start (missing shared lib, CDI spec
error, port conflict, permissions, corrupted download) surfaces the same
false-positive.

This PR fixes the false-positive at the caller site in
`startDockerDriverGateway()`, without modifying the shared
`isGatewayHealthy()` (which is pinned to pure-function status by the
#2020 follow-up test).

Fixes #3111.
Closes the gap covered by the failing E2E test added in #3362 —
`gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on
this branch.

## Root cause

`startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway
binary with `spawn(..., { detached: true })` + `child.unref()`, then
polled using two checks that could both lie:

1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns
`true` for **zombies**. Since the parent Node process never `wait()`s on
the detached child, crashed children linger as zombies and `isPidAlive`
reports them as alive.

2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string
match over openshell CLI output. `isGatewayConnected` in
`src/lib/state/gateway.ts` matches on `"Server Status"` — the **table
header** that `openshell status` always prints. On a crashed gateway,
the header is still emitted and the body contains `× client error
(Connect) tcp connect error Connection refused` — but
`isGatewayConnected` returns true anyway.

Smoking gun from the red-on-main run
[25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380):

```
[DIAG] openshell status: Server Status
  Gateway: nemoclaw
  Server: http://127.0.0.1:8080
Error:   × client error (Connect)
  ├─▶ tcp connect error
  ╰─▶ Connection refused (os error 111)
```

## Changes

**`src/lib/onboard.ts`** — three coordinated changes in
`startDockerDriverGateway`:

1. **Track child-exit via the ChildProcess `'exit'` event**, not just
`isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited`
flag that the poll loop consults alongside `isPidAlive`. This catches
zombies that `isPidAlive` misses and also captures the exit code/signal
for the failure message.

2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP
connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a
socket timeout. Resolves boolean, never throws. This is the
Docker-driver path equivalent of `verifyGatewayContainerRunning` (added
for #2020 on the K3s path).

3. **Gate the "healthy" log on the TCP probe**: the poll loop now only
logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy`
**AND** a successful TCP connect. On probe failure the loop keeps
polling — the binary may still be binding its listener. The
`childExited` check at the top of the loop terminates us if the process
actually died.

Also improves the final failure message to include **how** the gateway
exited (signal vs. exit code) so users don't have to `tail` the gateway
log.

## What this PR does NOT change

`isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The
#2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it
to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at
the caller, not the shared helper — same pattern as #2020.

## Tests

- **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests)
- `verifyDockerDriverGatewayListening` resolves true for listening ports
  - resolves false for closed ports (Connection refused)
  - resolves false on timeout (non-routable RFC 1918 host)
  - enforces minimum 50 ms timeout
  - never throws
- **source-shape guards** (regexes over `src/lib/onboard.ts`):
child-exit tracking present, `childExited || !isPidAlive` check at
poll-loop top, TCP probe called before the "healthy" log, exit details
surfaced in the failure message
- **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) —
red on main, expected green on this branch. Will dispatch via
nightly-e2e selective run once PR opens.
- **Existing tests:** full `vitest` suite passes (CLI);
`test/gateway-state.test.ts` (47 tests) and
`test/gateway-liveness-probe.test.ts` (7 tests) still green — no
behavior change in the shared helpers.

## Refactoring alignment

Noted in `ACCEPTANCE.md` (also in this branch):

- **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP
probe helper's timeout logic, for mechanical adoption later.
- **#3213** (unified advisory registry) — `TODO(#3213)` on the
failure-message format so it can migrate to structured advisories later.
- **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same
pattern, different surface. Easy to converge once both land; can extract
a shared "gateway liveness probe" module if desired.
- **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch
`startDockerDriverGateway`; no conflict expected.

## Related

- Fixes #3111
- Coverage guard: #3362 (`gateway-health-honest-e2e`)
- Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP
readiness for K3s reuse)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a TCP readiness check during gateway startup to ensure the
service is actually listening before reporting healthy.
* **Bug Fixes**
* Reduced false “alive” reports for the gateway by detecting early
child-process termination.
* Improved startup failure messages to include how the gateway process
terminated (signal vs exit code).
* **Tests**
* Added integration and unit tests to validate startup health and TCP
probe behavior.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
ericksoa added a commit that referenced this pull request May 13, 2026
## Summary

Moves the existing #3111 `gateway-health-honest-e2e` coverage guard out
of scheduled `nightly-e2e.yaml` and into the new `regression-e2e.yaml`
holding-pen workflow.

## Why

Failing-test-first guards and high-signal regression anchors should be
easy to dispatch while fixes are in flight, but should not automatically
keep scheduled nightly red. The new regression workflow gives us a place
to keep these guards and periodically review/promote stable ones into
nightly.

## What changed

- Removed `gateway-health-honest-e2e` from `nightly-e2e.yaml` job list
and reporting dependencies.
- Added `regression-e2e.yaml` with `gateway-health-honest-e2e` as a
manually dispatchable regression job.
- Left `test/e2e/test-gateway-health-honest.sh` unchanged.

## Validation

- YAML parse for `nightly-e2e.yaml`
- YAML parse for `regression-e2e.yaml`

Related: #3111
Related PR: #3362


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Removed the automated nightly E2E run for gateway-health-honest and
updated nightly job selection and downstream notifications accordingly
* Added a manual regression E2E workflow for gateway-health-honest with
on-demand job selection, gating logic, concurrency control, and
failure-artifact upload
* Updated local testing guidance to recommend using the new regression
workflow for gateway-health-honest runs

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3411)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps fix

Projects

None yet

3 participants