test(e2e): add gateway-health-honest coverage guard for #3111#3362
Conversation
Adds a failing E2E test that demonstrates the #3111 false-positive "Docker-driver gateway is healthy" bug. Until the fix for #3111 lands, the nightly gateway-health-honest-e2e job will fail. This is intentional — the failing test is the proof of coverage and the executable acceptance criterion for #3111. The bug reported on Ubuntu 22.04 (GLIBC 2.35 vs shipped binary linked against GLIBC 2.38/2.39) is a specific instance of a platform-independent NemoClaw bug: startGateway() spawns a detached child, the crashed child remains a zombie, isPidAlive() returns true for zombies, registerDockerDriverGatewayEndpoint() writes metadata without probing, and isGatewayHealthy() is a string match on openshell CLI output rather than a live health check. Result: onboard logs "healthy" regardless of whether the gateway actually runs. The test sabotages the gateway binary (via NEMOCLAW_OPENSHELL_GATEWAY_BIN) with a shim that matches the #3111 failure mode, then asserts: - primary: log does NOT contain 'Docker-driver gateway is healthy' - corroborating: node process exits non-zero - corroborating: user-visible failure message surfaced - corroborating: no live non-zombie gateway process remains Runs on ubuntu-latest — the test exercises the NemoClaw-side false- positive, not the OpenShell-side GLIBC packaging. Related: #3111
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a nightly workflow job and a new end-to-end test that simulates an openshell-gateway GLIBC startup crash, asserts the gateway is not reported "healthy", checks for failure indicators, and ensures no live non-zombie gateway process remains. ChangesCoverage Guard for Issue
Sequence Diagram(s)sequenceDiagram
participant NodeHarness as Node test harness
participant SabotageBin as Sabotage gateway shim
participant Filesystem as Logs / PID file
NodeHarness->>SabotageBin: spawn via NEMOCLAW_OPENSHELL_GATEWAY_BIN
SabotageBin-->>NodeHarness: stderr GLIBC_2.38/2.39 not found + exit 127
NodeHarness->>Filesystem: write start log and PID file
NodeHarness->>Filesystem: capture exit code and log contents
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Selective E2E Results — ❌ Some jobs failedRun: 25696214021
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 1271-1304: Add a new path_instructions mapping in .coderabbit.yaml
for the gateway-health-honest-e2e job: create an entry keyed by
"gateway-health-honest-e2e" that lists "test/e2e/test-gateway-health-honest.sh"
as the covered source file and includes a brief instruction string describing
that this job runs the gateway health-honesty E2E test (e.g., "covers gateway
health-honesty E2E: test-gateway-health-honest.sh"); ensure the key matches the
job name gateway-health-honest-e2e and the path exactly matches
test/e2e/test-gateway-health-honest.sh so the test/validate-e2e-coverage.test.ts
cross-validation passes.
In `@test/e2e/test-gateway-health-honest.sh`:
- Line 37: The script uses "set -uo pipefail" but lacks -e, so add -e to make it
"set -euo pipefail" to fail fast on errors; then harden the assertions around
the existing acceptance of generic "not found" (the blocks currently at ~lines
110-116 and ~188-190) by replacing broad string matches with strict checks:
assert exact expected error messages or explicit exit codes from the command
under test (or use command -v/which to verify binaries exist before testing),
and remove any "|| true" or permissive greps that allow unrelated setup failures
to produce a false-green pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6db13377-1dc3-4ed1-bf2a-d1fefe9c9690
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-gateway-health-honest.sh
Pure whitespace/redirect formatting applied by the repo's shfmt hook. No behavioral change.
Addresses CodeRabbit review feedback on #3362: 1. Add `.coderabbit.yaml` path_instructions entry for the new test/e2e/test-gateway-health-honest.sh script so validate-e2e-coverage cross-validation passes and future reviewers get guidance on how to dispatch the job selectively. 2. Harden the test against false-green passes from unrelated setup errors: - switch `set -uo pipefail` → `set -euo pipefail` so that npm-ci / build:cli / install-openshell.sh failures fail the test script immediately instead of letting downstream assertions run against an empty log; - restructure the cleanup() pid-read so the `[ -f ... ] && ...` pattern doesn't fail the script under -e when the pid file is absent; - add a pre-assertion that positively proves the sabotage shim was invoked (GLIBC_2.38/2.39 or openshell-gateway-sabotage markers in the start log); without this, a stale dist/ or a broken build could satisfy the primary 'no healthy' assertion without exercising the gateway-failure code path; - narrow corroborating assertion 2 to exclude generic 'not found' which could be satisfied by an unrelated module-not-found. No change to the primary assertion or the red-on-main behavior for #3111.
Selective E2E Results — ❌ Some jobs failedRun: 25697935007
|
The sabotage shim's stderr is written to the gateway log file opened by onboard.ts:startGatewayWithOptions ($STATE_DIR/openshell-gateway.log), not to the start log which only captures node's stdout/stderr. The first nightly dispatch against the hardened version surfaced this: the pre-assertion correctly caught that the markers were missing from $START_LOG — but they were actually present in the right log file all along. Fix: check the authoritative gateway log. Also expand fail() diagnostics to dump the onboard gateway log tail.
Selective E2E Results — ❌ Some jobs failedRun: 25698031380
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e/test-gateway-health-honest.sh (1)
98-107: ⚡ Quick winAdd marker file to cleanup if implemented.
If you implement the marker file approach, ensure it's cleaned up:
cleanup() { set +e if [ -f "$PID_FILE" ]; then CHILD_PID="$(tr -d '[:space:]' <"$PID_FILE")" fi cleanup_pid "$CHILD_PID" openshell gateway remove nemoclaw >/dev/null 2>&1 || true - rm -f "$PID_FILE" "$SABOTAGE_BIN" + rm -f "$PID_FILE" "$SABOTAGE_BIN" "${STATE_DIR}/.sabotage-marker" }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-gateway-health-honest.sh` around lines 98 - 107, The cleanup function currently removes PID_FILE and SABOTAGE_BIN but doesn't remove the optional marker file if you implemented the marker approach; update the cleanup() implementation (the cleanup function and the rm -f invocation referencing "$PID_FILE" and "$SABOTAGE_BIN") to also remove the marker file (e.g., "$MARKER_FILE" or whatever variable/name you used for the marker) and ensure any conditional references (like checking if the marker exists) are handled so the marker is always cleaned up on EXIT.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-gateway-health-honest.sh`:
- Around line 129-140: The sabotage shim currently only prints GLIBC markers to
stderr (written by the helper at SABOTAGE_BIN) which get lost because
startGateway() detaches the child; modify the shim to also create a simple
marker file (e.g., "${STATE_DIR}/.sabotage-marker") when executed so there's a
persistent proof the binary ran, then update the pre-assertion that checks
START_LOG to accept either the existing stderr grep OR the presence of that
MARKER_FILE and remove the marker after the check; reference SABOTAGE_BIN
(shim), startGateway(), START_LOG, STATE_DIR and MARKER_FILE when making these
changes.
- Around line 169-179: Update the pre-assertion in the block that inspects
START_LOG (the if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage'
"$START_LOG"; then ...) to also accept the sabotage shim's marker file: check
for a marker path (e.g., SABOTAGE_MARKER or SABOTAGE_MARKER_FILE) and consider
the test exercised if that file exists; if neither the grep finds the
GLIBC/openshell marker nor the marker file exists then call fail with the
existing message, otherwise call pass (and optionally adjust the pass message to
mention the marker file when present). Ensure you reference START_LOG, the grep
check, the fail and pass calls when making the change.
---
Nitpick comments:
In `@test/e2e/test-gateway-health-honest.sh`:
- Around line 98-107: The cleanup function currently removes PID_FILE and
SABOTAGE_BIN but doesn't remove the optional marker file if you implemented the
marker approach; update the cleanup() implementation (the cleanup function and
the rm -f invocation referencing "$PID_FILE" and "$SABOTAGE_BIN") to also remove
the marker file (e.g., "$MARKER_FILE" or whatever variable/name you used for the
marker) and ensure any conditional references (like checking if the marker
exists) are handled so the marker is always cleaned up on EXIT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1d726147-0282-46a2-989e-43a21bbc491b
📒 Files selected for processing (2)
.coderabbit.yamltest/e2e/test-gateway-health-honest.sh
✅ Files skipped from review due to trivial changes (1)
- .coderabbit.yaml
| info "node exit code: ${NODE_EXIT}" | ||
|
|
||
| # ── Pre-assertion: prove the sabotage path was actually exercised ─── | ||
| # Without this guard, an unrelated setup failure (module-not-found, | ||
| # missing env, stale dist/, etc.) could produce a log that happens to | ||
| # lack the 'healthy' string and thereby false-green the primary | ||
| # assertion. We require positive evidence that the sabotage shim ran. | ||
| if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then | ||
| fail "Sabotage markers (GLIBC_2.38/2.39 or 'openshell-gateway-sabotage') not observed in start log — the test may have failed before the sabotaged gateway was invoked, so the assertions below cannot be trusted. Inspect the start log above for the real cause." | ||
| fi | ||
| pass "Sabotage shim was invoked as expected (GLIBC/sabotage markers present in log)" |
There was a problem hiding this comment.
Pre-assertion logic is sound; needs marker file fallback.
The guard against false-green from unrelated setup failures is well-designed. Once the sabotage shim is updated to write a marker file (per the fix above), update this check:
-if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then
+MARKER_FILE="${STATE_DIR}/.sabotage-marker"
+if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG" && [ ! -f "$MARKER_FILE" ]; then
fail "Sabotage markers ... not observed in start log ..."
fi
+rm -f "$MARKER_FILE"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/test-gateway-health-honest.sh` around lines 169 - 179, Update the
pre-assertion in the block that inspects START_LOG (the if ! grep -qE
'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$START_LOG"; then ...) to also
accept the sabotage shim's marker file: check for a marker path (e.g.,
SABOTAGE_MARKER or SABOTAGE_MARKER_FILE) and consider the test exercised if that
file exists; if neither the grep finds the GLIBC/openshell marker nor the marker
file exists then call fail with the existing message, otherwise call pass (and
optionally adjust the pass message to mention the marker file when present).
Ensure you reference START_LOG, the grep check, the fail and pass calls when
making the change.
…x md lint Address PR review findings: 1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc verifyDockerDriverGatewayListening (TCP probe) with a call to the shared isGatewayHttpReady helper — HTTP is strictly stronger than a raw TCP connect and is already the de-facto standard used by every other gateway-reuse decision site in onboard.ts after #3312. Docker-driver and K3s paths now converge on the same probe. 2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a helper that no longer exists). The shared helper's behavior is already covered by test/gateway-http-reuse-wait.test.ts (21 tests). Replace with test/gateway-health-honest-integration.test.ts (6 source-shape tests) guarding the #3111 integration pattern. 3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) — the 'checks' job flagged these; now passes locally. 4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is 'reuse existing helper' rather than 'add new TCP probe helper', and the refactoring alignment table records #3312 as 'adopted' rather than 'to be coordinated'. Behavior of the fix is unchanged: - child.once('exit') listener tracks zombies (detached children that process.kill(pid, 0) would falsely report as alive) - poll loop guard is childExited || !isPidAlive(childPid) - healthy log gated on isGatewayHttpReady() after isGatewayHealthy() - user-visible exit code/signal surfaced on failure Related: #3111 Coverage guard: #3362 — gateway-health-honest-e2e
…eporting healthy (#3111) (#3378) ## Summary `nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy` even when the openshell-gateway binary crashed immediately on startup, then failing the next step with `Connection refused`. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped binary, but the underlying NemoClaw bug is **platform-independent** — any reason the binary fails to start (missing shared lib, CDI spec error, port conflict, permissions, corrupted download) surfaces the same false-positive. This PR fixes the false-positive at the caller site in `startDockerDriverGateway()`, without modifying the shared `isGatewayHealthy()` (which is pinned to pure-function status by the #2020 follow-up test). Fixes #3111. Closes the gap covered by the failing E2E test added in #3362 — `gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on this branch. ## Root cause `startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway binary with `spawn(..., { detached: true })` + `child.unref()`, then polled using two checks that could both lie: 1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns `true` for **zombies**. Since the parent Node process never `wait()`s on the detached child, crashed children linger as zombies and `isPidAlive` reports them as alive. 2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string match over openshell CLI output. `isGatewayConnected` in `src/lib/state/gateway.ts` matches on `"Server Status"` — the **table header** that `openshell status` always prints. On a crashed gateway, the header is still emitted and the body contains `× client error (Connect) tcp connect error Connection refused` — but `isGatewayConnected` returns true anyway. Smoking gun from the red-on-main run [25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380): ``` [DIAG] openshell status: Server Status Gateway: nemoclaw Server: http://127.0.0.1:8080 Error: × client error (Connect) ├─▶ tcp connect error ╰─▶ Connection refused (os error 111) ``` ## Changes **`src/lib/onboard.ts`** — three coordinated changes in `startDockerDriverGateway`: 1. **Track child-exit via the ChildProcess `'exit'` event**, not just `isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited` flag that the poll loop consults alongside `isPidAlive`. This catches zombies that `isPidAlive` misses and also captures the exit code/signal for the failure message. 2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a socket timeout. Resolves boolean, never throws. This is the Docker-driver path equivalent of `verifyGatewayContainerRunning` (added for #2020 on the K3s path). 3. **Gate the "healthy" log on the TCP probe**: the poll loop now only logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy` **AND** a successful TCP connect. On probe failure the loop keeps polling — the binary may still be binding its listener. The `childExited` check at the top of the loop terminates us if the process actually died. Also improves the final failure message to include **how** the gateway exited (signal vs. exit code) so users don't have to `tail` the gateway log. ## What this PR does NOT change `isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The #2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at the caller, not the shared helper — same pattern as #2020. ## Tests - **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests) - `verifyDockerDriverGatewayListening` resolves true for listening ports - resolves false for closed ports (Connection refused) - resolves false on timeout (non-routable RFC 1918 host) - enforces minimum 50 ms timeout - never throws - **source-shape guards** (regexes over `src/lib/onboard.ts`): child-exit tracking present, `childExited || !isPidAlive` check at poll-loop top, TCP probe called before the "healthy" log, exit details surfaced in the failure message - **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) — red on main, expected green on this branch. Will dispatch via nightly-e2e selective run once PR opens. - **Existing tests:** full `vitest` suite passes (CLI); `test/gateway-state.test.ts` (47 tests) and `test/gateway-liveness-probe.test.ts` (7 tests) still green — no behavior change in the shared helpers. ## Refactoring alignment Noted in `ACCEPTANCE.md` (also in this branch): - **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP probe helper's timeout logic, for mechanical adoption later. - **#3213** (unified advisory registry) — `TODO(#3213)` on the failure-message format so it can migrate to structured advisories later. - **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same pattern, different surface. Easy to converge once both land; can extract a shared "gateway liveness probe" module if desired. - **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch `startDockerDriverGateway`; no conflict expected. ## Related - Fixes #3111 - Coverage guard: #3362 (`gateway-health-honest-e2e`) - Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP readiness for K3s reuse) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a TCP readiness check during gateway startup to ensure the service is actually listening before reporting healthy. * **Bug Fixes** * Reduced false “alive” reports for the gateway by detecting early child-process termination. * Improved startup failure messages to include how the gateway process terminated (signal vs exit code). * **Tests** * Added integration and unit tests to validate startup health and TCP probe behavior. [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary Moves the existing #3111 `gateway-health-honest-e2e` coverage guard out of scheduled `nightly-e2e.yaml` and into the new `regression-e2e.yaml` holding-pen workflow. ## Why Failing-test-first guards and high-signal regression anchors should be easy to dispatch while fixes are in flight, but should not automatically keep scheduled nightly red. The new regression workflow gives us a place to keep these guards and periodically review/promote stable ones into nightly. ## What changed - Removed `gateway-health-honest-e2e` from `nightly-e2e.yaml` job list and reporting dependencies. - Added `regression-e2e.yaml` with `gateway-health-honest-e2e` as a manually dispatchable regression job. - Left `test/e2e/test-gateway-health-honest.sh` unchanged. ## Validation - YAML parse for `nightly-e2e.yaml` - YAML parse for `regression-e2e.yaml` Related: #3111 Related PR: #3362 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed the automated nightly E2E run for gateway-health-honest and updated nightly job selection and downstream notifications accordingly * Added a manual regression E2E workflow for gateway-health-honest with on-demand job selection, gating logic, concurrency control, and failure-artifact upload * Updated local testing guidance to recommend using the new regression workflow for gateway-health-honest runs [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3411) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Coverage guard for #3111 — "Docker-driver gateway is healthy" false-positive
This PR adds an E2E regression test that fails on
maintoday. It isintentional that this test will be red on
mainand nightly will gored until the fix for #3111 lands.
The gap
Issue #3111 reports that on Ubuntu 22.04 the onboard flow prints:
while the gateway log shows the binary never actually ran:
The underlying NemoClaw bug is platform-independent:
startGateway()(src/lib/onboard.ts:5500+) spawns the gatewaybinary with
detached: trueandchild.unref(). When the binarycrashes, the detached child becomes a zombie, and
isPidAlive()(
process.kill(pid, 0)) returns true for zombies — so the poll loopdoesn't break.
registerDockerDriverGatewayEndpoint()(src/lib/onboard.ts:4347) ismetadata-only:
openshell gateway add --local --name nemoclaw <url>writes the endpoint to the config; it does NOT probe the endpoint.
isGatewayHealthy()(src/lib/state/gateway.ts:99) is a stringmatch on
openshell statusandopenshell gateway infooutput, nota live health check.
Result: on any Linux host where the gateway binary fails to start for
any reason (GLIBC mismatch, missing shared lib, permissions, OOM,
CDI-spec error, corrupted binary…), onboard reports
✓ Docker-driver gateway is healthyand proceeds to the next onboardstep, which then fails with a confusing
Connection refuseddownstream.There is an existing
openshell-gateway-upgrade-e2etest covering thestale-gateway-replaced path for PR #3001, but no test covers the
gateway-binary-crashes path that is the root issue in #3111.
What this test does
test/e2e/test-gateway-health-honest.sh:openshell+openshell-gatewaybinaries viascripts/install-openshell.sh(same setup path as the existingupgrade test).
$STATE_DIR/openshell-gateway-sabotagethatexits immediately with the same GLIBC-style stderr reported in [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111.
startGateway(null)via a Node heredoc, withNEMOCLAW_OPENSHELL_GATEWAY_BINpointing at the shim."Docker-driver gateway is healthy".(
failed to start,crash,exit,not found, or a thrownexception).
after the simulated crash.
The test runs on
ubuntu-latestand does not require an Ubuntu 22.04runner — it exercises the NemoClaw-side bug class, not the OpenShell-side
GLIBC packaging choice. The GLIBC compatibility concern is an OpenShell
team issue and is out of scope for this coverage guard.
Expected CI behavior
gateway-health-honest-e2e✓ Docker-driver gateway is healthy.The red-nightly tradeoff
Once merged, the nightly badge will go red on
gateway-health-honest-e2euntil #3111 is fixed. That is the point — the failing test is the
executable acceptance criterion for the fix. A subsequent PR authored via
/skill:nemoclaw-issue-kickoff 3111will produce the fix with this testas its definition-of-done.
Expected failure output on main
Wiring
test/e2e/test-gateway-health-honest.sh(~170 LOC, modeledafter
test/e2e/test-openshell-gateway-upgrade.sh)gateway-health-honest-e2ein.github/workflows/nightly-e2e.yaml(6 edits: comment, inputs.jobs description, new job block, 3 needs arrays)
References
test/e2e/test-openshell-gateway-upgrade.shsrc/lib/onboard.ts—startGateway,startGatewayWithOptions,isPidAlive,registerDockerDriverGatewayEndpointsrc/lib/state/gateway.ts—isGatewayHealthySummary by CodeRabbit