Skip to content

Commit 32cab9d

Browse files
authored
test(e2e): add gateway-health-honest coverage guard for #3111 (#3362)
## Coverage guard for #3111 — "Docker-driver gateway is healthy" false-positive This PR adds an E2E regression test that fails on `main` today. It is intentional that **this test will be red on `main` and nightly will go red** until the fix for #3111 lands. ### The gap Issue #3111 reports that on Ubuntu 22.04 the onboard flow prints: ``` Starting OpenShell Docker-driver gateway... Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log ✓ Docker-driver gateway is healthy ← false positive ``` while the gateway log shows the binary never actually ran: ``` openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found openshell-gateway: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found ``` The underlying NemoClaw bug is **platform-independent**: - `startGateway()` (`src/lib/onboard.ts:5500+`) spawns the gateway binary with `detached: true` and `child.unref()`. When the binary crashes, the detached child becomes a **zombie**, and `isPidAlive()` (`process.kill(pid, 0)`) returns true for zombies — so the poll loop doesn't break. - `registerDockerDriverGatewayEndpoint()` (`src/lib/onboard.ts:4347`) is **metadata-only**: `openshell gateway add --local --name nemoclaw <url>` writes the endpoint to the config; it does NOT probe the endpoint. - `isGatewayHealthy()` (`src/lib/state/gateway.ts:99`) is a **string match on `openshell status` and `openshell gateway info` output**, not a live health check. Result: on any Linux host where the gateway binary fails to start for any reason (GLIBC mismatch, missing shared lib, permissions, OOM, CDI-spec error, corrupted binary…), onboard reports `✓ Docker-driver gateway is healthy` and proceeds to the next onboard step, which then fails with a confusing `Connection refused` downstream. There is an existing `openshell-gateway-upgrade-e2e` test covering the **stale-gateway-replaced** path for PR #3001, but no test covers the **gateway-binary-crashes** path that is the root issue in #3111. ### What this test does `test/e2e/test-gateway-health-honest.sh`: 1. Installs the real `openshell` + `openshell-gateway` binaries via `scripts/install-openshell.sh` (same setup path as the existing upgrade test). 2. Drops a sabotage shim at `$STATE_DIR/openshell-gateway-sabotage` that exits immediately with the same GLIBC-style stderr reported in #3111. 3. Invokes `startGateway(null)` via a Node heredoc, with `NEMOCLAW_OPENSHELL_GATEWAY_BIN` pointing at the shim. 4. Asserts: - **Primary:** the onboard output does NOT contain `"Docker-driver gateway is healthy"`. - **Corroborating:** the node process exits non-zero (≠ 0). - **Corroborating:** onboard surfaces a user-visible failure line (`failed to start`, `crash`, `exit`, `not found`, or a thrown exception). - **Corroborating:** no live non-zombie gateway process remains after the simulated crash. The test runs on `ubuntu-latest` and does not require an Ubuntu 22.04 runner — it exercises the NemoClaw-side bug class, not the OpenShell-side GLIBC packaging choice. The GLIBC compatibility concern is an OpenShell team issue and is out of scope for this coverage guard. ### Expected CI behavior | Ref | Nightly job `gateway-health-honest-e2e` | |---|---| | **this PR** | PR checks don't run nightly, so this PR's CI is green (modulo the usual unit/lint suite). | | **main (after merge)** | **FAILS.** Primary assertion trips: onboard logs `✓ Docker-driver gateway is healthy`. | | **any branch that fixes #3111** | PASSES. | ### The red-nightly tradeoff Once merged, the nightly badge will go red on `gateway-health-honest-e2e` until #3111 is fixed. That is the point — the failing test is the executable acceptance criterion for the fix. A subsequent PR authored via `/skill:nemoclaw-issue-kickoff 3111` will produce the fix with this test as its definition-of-done. ### Expected failure output on main ``` [FAIL] Onboard reported '✓ Docker-driver gateway is healthy' although the gateway binary crashed on startup (#3111 false-positive health check) [DIAG] start log tail: Starting OpenShell Docker-driver gateway... Gateway log: /home/runner/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log ✓ Docker-driver gateway is healthy __onboard_startGateway_returned_successfully__ [DIAG] gateway log tail: openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found openshell-gateway-sabotage: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found ``` ### Wiring - New file: `test/e2e/test-gateway-health-honest.sh` (~170 LOC, modeled after `test/e2e/test-openshell-gateway-upgrade.sh`) - New job: `gateway-health-honest-e2e` in `.github/workflows/nightly-e2e.yaml` (6 edits: comment, inputs.jobs description, new job block, 3 needs arrays) ### References - Issue #3111 (NV QA / UAT / Platform: Brev / Platform: Ubuntu / Docker / bug) - NVB#6150133 - PR #3001 (merged; introduced the affected code path) - Existing test being modeled: `test/e2e/test-openshell-gateway-upgrade.sh` - Relevant source: - `src/lib/onboard.ts` — `startGateway`, `startGatewayWithOptions`, `isPidAlive`, `registerDockerDriverGatewayEndpoint` - `src/lib/state/gateway.ts` — `isGatewayHealthy` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added an end-to-end test that verifies gateway start failure is detected, exposes user-visible error output, and ensures no lingering gateway processes remain. * **Chores** * Integrated the new test into the nightly pipeline as a selectable job, including artifact upload on failure and inclusion in failure reporting, PR comments, and the nightly scorecard. * **Documentation** * Updated nightly workflow job selection documentation and added a review-tool entry linking the test to the nightly job. [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362) [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3362) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 419f895 commit 32cab9d

3 files changed

Lines changed: 313 additions & 1 deletion

File tree

.coderabbit.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -492,6 +492,24 @@ reviews:
492492
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=openshell-gateway-upgrade-e2e
493493
```
494494
495+
- path: "test/e2e/test-gateway-health-honest.sh"
496+
instructions: |
497+
This script is the coverage guard for #3111. It sabotages the
498+
openshell-gateway binary with a shim that crashes on startup and
499+
asserts onboard does NOT falsely log "Docker-driver gateway is
500+
healthy" and does exit non-zero. Exercises the NemoClaw-side
501+
false-positive (detached-zombie isPidAlive + metadata-only
502+
isGatewayHealthy) regardless of OpenShell packaging choices.
503+
504+
**E2E test recommendation:**
505+
- `gateway-health-honest-e2e` — false-positive health check when
506+
the gateway binary crashes on startup (#3111)
507+
508+
To run selectively:
509+
```
510+
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=gateway-health-honest-e2e
511+
```
512+
495513
- path: ".github/workflows/nightly-e2e.yaml"
496514
instructions: |
497515
This is the nightly E2E workflow definition. Changes here affect

.github/workflows/nightly-e2e.yaml

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@
2222
# openshell-gateway-upgrade-e2e
2323
# Validates stale Linux Docker-driver OpenShell gateway
2424
# processes are restarted after an OpenShell upgrade.
25+
# gateway-health-honest-e2e
26+
# Coverage guard for #3111: onboard must not log
27+
# "Docker-driver gateway is healthy" when the gateway
28+
# binary crashes on startup (detached-zombie + metadata-
29+
# only health check regression).
2530
# hermes-e2e Hermes Agent E2E — install → onboard --agent hermes → health
2631
# probe → live inference. Validates the multi-agent architecture.
2732
# hermes-inference-switch-e2e
@@ -67,7 +72,7 @@ on:
6772
messaging-compatible-endpoint-e2e,
6873
kimi-inference-compat-e2e,
6974
token-rotation-e2e, sandbox-survival-e2e,
70-
openshell-gateway-upgrade-e2e,
75+
openshell-gateway-upgrade-e2e, gateway-health-honest-e2e,
7176
issue-2478-crash-loop-recovery-e2e, hermes-e2e,
7277
hermes-inference-switch-e2e, hermes-discord-e2e,
7378
hermes-slack-e2e, sandbox-operations-e2e, inference-routing-e2e,
@@ -1290,6 +1295,58 @@ jobs:
12901295
/tmp/nemoclaw-e2e-openshell-gateway-process.log
12911296
if-no-files-found: ignore
12921297

1298+
# ── Gateway health-honesty E2E ──────────────────────────────
1299+
# Coverage guard for #3111. Issue #3111 reported that onboard prints
1300+
# "✓ Docker-driver gateway is healthy" on Ubuntu 22.04 even though the
1301+
# shipped openshell-gateway binary (GNU-linked against GLIBC 2.38/2.39)
1302+
# crashes immediately on a 22.04 host (GLIBC 2.35).
1303+
#
1304+
# Root cause is platform-independent: the detached child remains a
1305+
# zombie so isPidAlive() returns true, registerDockerDriverGatewayEndpoint()
1306+
# writes metadata without any TCP probe, and isGatewayHealthy() is a
1307+
# string match on openshell CLI output rather than a real health check.
1308+
# Any scenario where the gateway binary fails before serving connections
1309+
# will surface the same false-positive log on ANY Linux host — not just
1310+
# Ubuntu 22.04.
1311+
#
1312+
# This test sabotages the gateway binary with a shim that matches the
1313+
# #3111 failure mode (immediate exit with GLIBC-style stderr) and asserts
1314+
# that onboard does NOT log "healthy" and exits non-zero.
1315+
gateway-health-honest-e2e:
1316+
if: >-
1317+
github.repository == 'NVIDIA/NemoClaw' &&
1318+
(github.event_name != 'workflow_dispatch' ||
1319+
inputs.jobs == '' ||
1320+
contains(format(',{0},', inputs.jobs), ',gateway-health-honest-e2e,'))
1321+
runs-on: ubuntu-latest
1322+
timeout-minutes: 20
1323+
steps:
1324+
- name: Checkout
1325+
uses: actions/checkout@v6
1326+
1327+
- name: Setup Node
1328+
uses: actions/setup-node@v6
1329+
with:
1330+
node-version: "22"
1331+
1332+
- name: Run gateway health-honesty E2E test
1333+
env:
1334+
GITHUB_TOKEN: ${{ github.token }}
1335+
NEMOCLAW_NON_INTERACTIVE: "1"
1336+
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1337+
run: bash test/e2e/test-gateway-health-honest.sh
1338+
1339+
- name: Upload gateway health-honesty logs on failure
1340+
if: failure()
1341+
uses: actions/upload-artifact@v4
1342+
with:
1343+
name: gateway-health-honest-logs
1344+
path: |
1345+
/tmp/nemoclaw-e2e-gateway-health-honest.log
1346+
/tmp/nemoclaw-e2e-gateway-health-honest-start.log
1347+
/tmp/nemoclaw-e2e-gateway-health-honest-process.log
1348+
if-no-files-found: ignore
1349+
12931350
# ── Hermes rebuild upgrade E2E ──────────────────────────────
12941351
# Same upgrade scenario as OpenClaw but for Hermes Agent.
12951352
rebuild-hermes-e2e:
@@ -1861,6 +1918,7 @@ jobs:
18611918
rebuild-openclaw-e2e,
18621919
upgrade-stale-sandbox-e2e,
18631920
openshell-gateway-upgrade-e2e,
1921+
gateway-health-honest-e2e,
18641922
rebuild-hermes-e2e,
18651923
rebuild-hermes-stale-base-e2e,
18661924
double-onboard-e2e,
@@ -1952,6 +2010,7 @@ jobs:
19522010
rebuild-openclaw-e2e,
19532011
upgrade-stale-sandbox-e2e,
19542012
openshell-gateway-upgrade-e2e,
2013+
gateway-health-honest-e2e,
19552014
rebuild-hermes-e2e,
19562015
rebuild-hermes-stale-base-e2e,
19572016
double-onboard-e2e,
@@ -2091,6 +2150,7 @@ jobs:
20912150
rebuild-openclaw-e2e,
20922151
upgrade-stale-sandbox-e2e,
20932152
openshell-gateway-upgrade-e2e,
2153+
gateway-health-honest-e2e,
20942154
rebuild-hermes-e2e,
20952155
rebuild-hermes-stale-base-e2e,
20962156
double-onboard-e2e,
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
#!/usr/bin/env bash
2+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Coverage guard for issue #3111 — "Docker-driver gateway is healthy"
6+
# must not be logged when the gateway binary failed to start.
7+
#
8+
# Background: PR #3001 introduced a Linux Docker-driver gateway managed by
9+
# onboard.ts:startGateway(). On Ubuntu 22.04, the shipped openshell-gateway
10+
# binary is linked against GLIBC 2.38/2.39 and crashes immediately on a
11+
# 22.04 host (GLIBC 2.35). NemoClaw still reports "✓ Docker-driver gateway
12+
# is healthy" because:
13+
# - the detached child becomes a zombie, so isPidAlive(childPid) returns
14+
# true (the pid remains in the process table until the parent reaps it);
15+
# - registerDockerDriverGatewayEndpoint() is metadata-only (openshell
16+
# gateway add --local) and succeeds without any TCP probe;
17+
# - isGatewayHealthy() reads openshell status / gateway info strings,
18+
# not a live health probe — so cached / metadata-only output satisfies
19+
# the check.
20+
#
21+
# This test is platform-independent: instead of exercising the GLIBC path
22+
# (which requires a 22.04 runner we don't have in CI) it substitutes the
23+
# gateway binary with a shim that crashes immediately with the same
24+
# GLIBC-style error on stderr. Any onboard that treats a crashed child as
25+
# healthy fails this test. The fix for #3111 must make startGateway verify
26+
# the child is actually alive (not a zombie) and that the endpoint serves
27+
# a real TCP probe before declaring "healthy".
28+
#
29+
# Expected result on main (bug present): FAIL — the test asserts onboard
30+
# must NOT print "Docker-driver gateway is healthy" when the binary
31+
# crashed; current code does print it, so the assertion fails.
32+
# Expected result after fix: PASS — onboard surfaces the crash and exits
33+
# non-zero.
34+
#
35+
# Related: #3111, PR #3001
36+
37+
set -euo pipefail
38+
39+
LOG_FILE="/tmp/nemoclaw-e2e-gateway-health-honest.log"
40+
START_LOG="/tmp/nemoclaw-e2e-gateway-health-honest-start.log"
41+
GATEWAY_LOG="/tmp/nemoclaw-e2e-gateway-health-honest-process.log"
42+
exec > >(tee "$LOG_FILE") 2>&1
43+
44+
RED='\033[0;31m'
45+
GREEN='\033[0;32m'
46+
YELLOW='\033[1;33m'
47+
NC='\033[0m'
48+
49+
pass() { echo -e "${GREEN}[PASS]${NC} $1"; }
50+
info() { echo -e "${YELLOW}[INFO]${NC} $1"; }
51+
diag() { echo -e "${YELLOW}[DIAG]${NC} $1"; }
52+
fail() {
53+
echo -e "${RED}[FAIL]${NC} $1" >&2
54+
diag "start log tail:"
55+
tail -80 "$START_LOG" 2>/dev/null || true
56+
diag "gateway process log tail:"
57+
tail -80 "$GATEWAY_LOG" 2>/dev/null || true
58+
diag "onboard gateway log tail (where sabotage stderr lands):"
59+
tail -80 "${STATE_DIR}/openshell-gateway.log" 2>/dev/null || true
60+
diag "openshell status: $(openshell status 2>&1 || true)"
61+
diag "gateway info: $(openshell gateway info -g nemoclaw 2>&1 || true)"
62+
diag "pid file: $(cat "${PID_FILE:-/dev/null}" 2>/dev/null || echo missing)"
63+
exit 1
64+
}
65+
66+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
67+
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
68+
STATE_DIR="${NEMOCLAW_OPENSHELL_GATEWAY_STATE_DIR:-$HOME/.local/state/nemoclaw/openshell-docker-gateway}"
69+
PID_FILE="${STATE_DIR}/openshell-gateway.pid"
70+
SABOTAGE_BIN="${STATE_DIR}/openshell-gateway-sabotage"
71+
CHILD_PID=""
72+
73+
load_shell_path() {
74+
if [ -f "$HOME/.bashrc" ]; then
75+
# shellcheck source=/dev/null
76+
source "$HOME/.bashrc" 2>/dev/null || true
77+
fi
78+
export NVM_DIR="${NVM_DIR:-$HOME/.nvm}"
79+
if [ -s "$NVM_DIR/nvm.sh" ]; then
80+
# shellcheck source=/dev/null
81+
. "$NVM_DIR/nvm.sh"
82+
fi
83+
if [ -d "$HOME/.local/bin" ] && [[ ":$PATH:" != *":$HOME/.local/bin:"* ]]; then
84+
export PATH="$HOME/.local/bin:$PATH"
85+
fi
86+
}
87+
88+
cleanup_pid() {
89+
local pid="$1"
90+
[ -n "$pid" ] || return 0
91+
if kill -0 "$pid" 2>/dev/null; then
92+
kill "$pid" 2>/dev/null || true
93+
sleep 1
94+
kill -9 "$pid" 2>/dev/null || true
95+
fi
96+
# Reap any zombies left over by the test
97+
wait "$pid" 2>/dev/null || true
98+
}
99+
100+
cleanup() {
101+
set +e
102+
if [ -f "$PID_FILE" ]; then
103+
CHILD_PID="$(tr -d '[:space:]' <"$PID_FILE")"
104+
fi
105+
cleanup_pid "$CHILD_PID"
106+
openshell gateway remove nemoclaw >/dev/null 2>&1 || true
107+
rm -f "$PID_FILE" "$SABOTAGE_BIN"
108+
}
109+
trap cleanup EXIT
110+
111+
cd "$REPO_ROOT"
112+
load_shell_path
113+
114+
info "Preparing CLI build and OpenShell binaries"
115+
if [ ! -d node_modules ]; then
116+
npm ci --ignore-scripts
117+
fi
118+
npm run build:cli
119+
bash scripts/install-openshell.sh
120+
load_shell_path
121+
122+
command -v openshell >/dev/null 2>&1 || fail "openshell not found after install"
123+
command -v openshell-gateway >/dev/null 2>&1 || fail "openshell-gateway not found after install"
124+
125+
# Start from a clean slate: no prior gateway metadata, no pid file.
126+
mkdir -p "$STATE_DIR"
127+
chmod 700 "$STATE_DIR"
128+
rm -f "$PID_FILE" "$START_LOG" "$GATEWAY_LOG"
129+
openshell gateway remove nemoclaw >/dev/null 2>&1 || true
130+
131+
info "Installing sabotage gateway binary that simulates the #3111 GLIBC crash"
132+
cat >"$SABOTAGE_BIN" <<'SHIM'
133+
#!/usr/bin/env bash
134+
# Simulates the Ubuntu 22.04 GLIBC-2.38/2.39 failure mode reported in #3111.
135+
# The real binary dies at the dynamic-linker stage before main() runs; we
136+
# mirror that by emitting the same stderr fragment and exiting non-zero
137+
# before opening any TCP port.
138+
printf '%s\n' "$(basename "$0"): /lib/x86_64-linux-gnu/libc.so.6: version \`GLIBC_2.38' not found (required by $(basename "$0"))" >&2
139+
printf '%s\n' "$(basename "$0"): /lib/x86_64-linux-gnu/libc.so.6: version \`GLIBC_2.39' not found (required by $(basename "$0"))" >&2
140+
exit 127
141+
SHIM
142+
chmod 755 "$SABOTAGE_BIN"
143+
144+
info "Invoking startGateway() with the sabotaged binary"
145+
# startGateway() with exitOnFailure:true calls process.exit(1) when it
146+
# concludes the gateway failed. A correctly-behaved onboard MUST either:
147+
# (a) exit non-zero, OR
148+
# (b) print "failed to start" / a surface error message,
149+
# and MUST NOT print "Docker-driver gateway is healthy".
150+
set +e
151+
NEMOCLAW_OPENSHELL_GATEWAY_BIN="$SABOTAGE_BIN" \
152+
NEMOCLAW_HEALTH_POLL_COUNT="${NEMOCLAW_HEALTH_POLL_COUNT:-10}" \
153+
NEMOCLAW_HEALTH_POLL_INTERVAL="${NEMOCLAW_HEALTH_POLL_INTERVAL:-1}" \
154+
node <<'NODE' 2>&1 | tee "$START_LOG"
155+
const { startGateway } = require("./dist/lib/onboard");
156+
157+
startGateway(null)
158+
.then(() => {
159+
console.log("__onboard_startGateway_returned_successfully__");
160+
process.exit(0);
161+
})
162+
.catch((error) => {
163+
console.error("__onboard_startGateway_threw__");
164+
console.error(error && error.stack ? error.stack : error);
165+
process.exit(3);
166+
});
167+
NODE
168+
NODE_EXIT=$?
169+
set -e
170+
171+
info "node exit code: ${NODE_EXIT}"
172+
173+
# ── Pre-assertion: prove the sabotage path was actually exercised ───
174+
# Without this guard, an unrelated setup failure (module-not-found,
175+
# missing env, stale dist/, etc.) could produce a $START_LOG that
176+
# happens to lack the 'healthy' string and thereby false-green the
177+
# primary assertion. We require positive evidence that the sabotage
178+
# shim ran.
179+
#
180+
# The sabotage shim writes its GLIBC-style stderr to the gateway log
181+
# file opened by onboard.ts:startGatewayWithOptions at
182+
# $STATE_DIR/openshell-gateway.log (NOT to the start log, which only
183+
# captures node's stdout/stderr). That gateway log is the authoritative
184+
# source of truth for "did our binary get exec'd".
185+
GATEWAY_ONBOARD_LOG="${STATE_DIR}/openshell-gateway.log"
186+
if ! grep -qE 'GLIBC_2\.3(8|9)|openshell-gateway-sabotage' "$GATEWAY_ONBOARD_LOG" 2>/dev/null; then
187+
fail "Sabotage markers (GLIBC_2.38/2.39 or 'openshell-gateway-sabotage') not observed in gateway log ${GATEWAY_ONBOARD_LOG} — the test may have failed before the sabotaged gateway was invoked, so the assertions below cannot be trusted. Inspect $START_LOG and $GATEWAY_ONBOARD_LOG above for the real cause."
188+
fi
189+
pass "Sabotage shim was invoked as expected (GLIBC/sabotage markers present in gateway log)"
190+
191+
# ── Primary assertion ────────────────────────────────────────────────
192+
# This is the bug from #3111. Onboard printed "healthy" while the child
193+
# process was a crashed zombie and had never served a real connection.
194+
if grep -q "✓ Docker-driver gateway is healthy" "$START_LOG" \
195+
|| grep -q "Docker-driver gateway is healthy" "$START_LOG"; then
196+
fail "Onboard reported '✓ Docker-driver gateway is healthy' although the gateway binary crashed on startup (#3111 false-positive health check)"
197+
fi
198+
pass "Onboard did not falsely log 'Docker-driver gateway is healthy' when the binary crashed"
199+
200+
# ── Corroborating assertion 1: non-zero exit ─────────────────────────
201+
# startGateway(null) uses exitOnFailure:true → the node process MUST exit
202+
# non-zero when the gateway truly failed to start. Exit 0 means onboard
203+
# silently accepted the crashed gateway as success.
204+
if [ "$NODE_EXIT" -eq 0 ] || grep -q "__onboard_startGateway_returned_successfully__" "$START_LOG"; then
205+
fail "startGateway() resolved successfully despite a crashed binary — onboard would have proceeded to inference setup against a dead gateway"
206+
fi
207+
pass "startGateway() did not resolve successfully with a crashed binary (node exit=${NODE_EXIT})"
208+
209+
# ── Corroborating assertion 2: user-visible failure surfaced ─────────
210+
# Deliberately narrow: excludes generic 'not found' because an unrelated
211+
# module-not-found (e.g. stale dist/) would satisfy the match without
212+
# proving the gateway-failure code path was exercised. The Pre-assertion
213+
# above already proves the sabotage ran, but this stays narrow anyway.
214+
if ! grep -qiE "failed to start|gateway.*(crash|exit|error)|__onboard_startGateway_threw__" "$START_LOG"; then
215+
fail "Onboard did not surface any gateway failure indicator to the user"
216+
fi
217+
pass "Onboard surfaced a user-visible gateway failure message"
218+
219+
# ── Corroborating assertion 3: no live gateway process ───────────────
220+
if [ -f "$PID_FILE" ]; then
221+
LINGERING_PID="$(tr -d '[:space:]' <"$PID_FILE")"
222+
if [ -n "$LINGERING_PID" ] && kill -0 "$LINGERING_PID" 2>/dev/null; then
223+
# A live pid that is *not* a zombie would mean onboard somehow kept
224+
# something alive. Zombies are acceptable as a transient artifact.
225+
STATE="$(ps -p "$LINGERING_PID" -o state= 2>/dev/null | tr -d ' ')"
226+
if [ "$STATE" != "Z" ] && [ -n "$STATE" ]; then
227+
fail "A non-zombie gateway pid (${LINGERING_PID}, state=${STATE}) is still alive after a simulated crash"
228+
fi
229+
fi
230+
fi
231+
pass "No live (non-zombie) gateway process is running after the simulated crash"
232+
233+
echo ""
234+
pass "#3111 coverage guard green: onboard correctly surfaces a crashed gateway"

0 commit comments

Comments
 (0)