Skip to content

Commit 98dbba4

Browse files
JohnMcLearclaude
andauthored
test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE (#7838)
* test(ci): heartbeat + running-test pointer in backend test diagnostics Backend tests silently die with code 255 mid-suite ~22% of the time on develop (most often Windows-with-plugins, Node 24). Each kill lands 300±50 ms after the previous test's clean ✔ teardown line and produces no failing-test marker, no error, no Mocha summary, and — despite the unconditional handlers in `diagnostics.ts` — none of the JS-level death events fire either. Recent example: run 26311025244 (`Windows with Plugins (24)`); both attempts crashed at completely different "last test" locations, so the dying test itself isn't to blame. The existing diagnostics only set lastSeenTest in afterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300ms gap we observe — the pointer reads as the previous (passing) test. That hides whether we're between tests or inside one, and which one. Two changes: 1. Track currentTest in beforeEach as well as lastFinishedTest in afterEach. Every diag line now carries both, so the death point is bracketable regardless of which lifecycle phase the kill interrupts. 2. Add a 1Hz heartbeat that writeSyncs the running-test name plus `process.memoryUsage()` (rss, heap) and the active-handle and active-request counts. The interval is unref'd so it never holds the event loop open by itself. Cost is roughly one extra log line per second of mocha runtime (~60-120 lines per CI run). When the next failure fires, the last heartbeat narrows the kill window to ≤1s, the running pointer names the test on the rails at that moment, and the handle/memory trace gives a sparkline that exposes sudden spikes — a leaked socket, an unref'd timer, a runaway map — that would otherwise be invisible at the runner-log level. No behavior change on successful runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: heartbeat _getActiveHandles optional chain bug (Qodo #2) Qodo correctly flagged `_getActiveHandles?.().length` as a latent TypeError: `?.()` guards the call but the call's `undefined` return on a missing method still hits `.length`, which throws. Since the heartbeat fires on a setInterval inside the mocha bootstrap, a Node build without the underscore-prefixed internals would take down the whole backend test run. Capture the array first, then read `.length` only when it actually exists. -1 stays as the "API missing" sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ci): per-test start diag + drop stray console.log noise Follow-up to the heartbeat PR after run 26397693748 confirmed the diagnostic works (the kill landed at importexportGetPost.ts 'Import authorization checks > authn anonymous !exist -> fail', ~300 ms after the previous test's ✔). Two cleanups so the next failure pinpoints faster and reads cleaner: 1. diagnostics.ts: emit a `test start: <name>` diag line in the mocha beforeEach hook, after setting the currentTest pointer. The 1Hz heartbeat misses tests that take less than a second, and the silent kills land ~300 ms after a test boundary — precisely the gap where heartbeat resolution fails. A start line per test gives sub-millisecond resolution on which test was on the rails when the process died. 2. specs/api/importexportGetPost.ts: drop a stray `console.log(importedPads)` debug leftover (and the duplicate `await importEtherpad(records)` only present to feed it) in the `malformed .etherpad files are rejected` block. The leftover dumped a ~600-line reflection of a supertest Response object to the CI log on every successful run, drowning the surrounding test output and making the silent-kill window much harder to read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ci): write node-report on every heartbeat tick Run 26398054688 narrowed the kill to a specific test (pad.ts > Gets text on a pad Id and doesn't have an excess newline) but the test body is a trivial supertest GET — the kill bypasses all JS handlers, so we can't capture stack state at death. Two failures across two runs share the shape: an agent.{get,post} + common.generateJWTToken() call dies ~300-600 ms after test start, with no JS-visible cause. The next step is V8 + native stack. Hook into the existing 1Hz heartbeat to call process.report.writeReport(path) whenever a report directory is set. The Windows backend-tests workflow already wires up `--report-directory=${{ github.workspace }}/node-report` via NODE_OPTIONS and uploads that directory as an artifact on failure, so the rolling snapshots ride for free on the existing upload step. Each report (~50 KB) contains: - V8 + native call stacks for all threads - libuv active handles (open TCP, timers, file handles) - JS heap statistics - resourceUsage + system info - shared-object list On the next reproduction the latest report before ELIFECYCLE will sit ~0-1 s before the kill — enough to see whether the V8 stack is inside jose's WebCrypto sign path, inside supertest's TCP roundtrip, or somewhere unexpected entirely. NODE_REPORT_DIR is also honored as an explicit override for local repro / non-workflow runs. Cost: ~6 files (~300 KB) per Windows backend-test failure, plus ~50 ms event-loop pause per heartbeat. No-op when neither env var is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: writeReport with bare filename, not mixed-slash absolute path Run 26398830249 exposed the path-separator bug in the previous commit: every heartbeat tick on the Windows runner logged Failed to open Node.js report file: D:\a\etherpad\etherpad/node-report/hb-NNNN-...json directory: D:\a\etherpad\etherpad/node-report (errno: 22) — EINVAL. The workflow sets --report-directory with forward-slash separators on Windows, then this code concatenated another `/` plus the filename, producing a path Node's report writer rejects. writeReport(fileName) takes a BARE filename and resolves it against the configured report directory using the platform-correct separator internally. Switch to that. For local repro overrides via NODE_REPORT_DIR, push the path into process.report.directory (the documented config knob) instead of joining it into the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ci): write node-report on test boundaries, throttled to 4Hz Run 26398985832 proved the heartbeat-only report cadence isn't tight enough: the last report before the kill was hb-0013 at +16201ms, ~1.5 s before ELIFECYCLE at +17701ms — during which ~30 tests fired, including the dying one (`authn anonymous !exist -> fail`). The captured V8 stack is just our heartbeat code, not the dying test. Move the writeReport call to a shared tryWriteReport() helper and invoke it from BOTH the heartbeat AND mocha's beforeEach hook, throttled to one report per 250 ms. That gives ≤250 ms resolution on the kill window — close enough that the latest report captures state from inside the dying test rather than from the test ~30 slots earlier. The heartbeat always writes (so we don't lose the no-test-running ticks during setup); beforeEach only writes when the throttle window has elapsed. Cost ceiling: ~4 reports/sec × ~12 s test phase ≈ 48 reports (~2.5 MB) per failing run. Each writeReport adds ~50 ms of event-loop pause — at 4Hz that's 20% of wall time spent in diagnostics, which is acceptable for a temporary debug-only bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ci): drop beforeEach report throttle from 250ms to 100ms Run 26399285213's rerun captured a sixth death point on the new 4Hz cadence (`socketio.ts > Duplicate-author handling > cookie identity: same-author second socket kicks the first`, kill at +45953ms, 271ms after test start). The throttle suppressed the dying test's own beforeEach: previous boundary write landed 128 ms earlier and the next 31 ms after that, both inside the 250 ms window. Last captured report (be-0100) is from the previous test. 100 ms is still well above the inter-test cadence in fast burst suites (tests fire 2-5 ms apart, so 20-50 of them get throttled to a single write, ceiling ~10 writes/sec). But it's tight enough that any death-window neighbour ≥100 ms after the previous report — the shape we keep observing — gets its own boundary snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d9dabe3 commit 98dbba4

2 files changed

Lines changed: 127 additions & 14 deletions

File tree

src/tests/backend/diagnostics.ts

Lines changed: 127 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,14 @@
1010
// in a way that bypassed JS handlers — SIGKILL, OOM, or a fatal native
1111
// error — OR mocha itself called process.exit before the handlers ran.
1212
//
13+
// Subsequent runs on develop (e.g. 26311025244, Windows with plugins,
14+
// Node 24) reproduced the same fingerprint: `[diag +0ms] diagnostics
15+
// loaded` lands but no other diag line appears before the silent
16+
// ELIFECYCLE. Death lands 300±50 ms after the previous test's teardown
17+
// and the lastSeenTest pointer (only updated in afterEach) tells us
18+
// nothing about whether the next test had started or whether mocha was
19+
// between tests at the kill moment.
20+
//
1321
// This file:
1422
// 1. Registers handlers UNCONDITIONALLY at mocha startup (common.ts is
1523
// only imported by ~27 of 47 specs, so its handlers may register
@@ -18,9 +26,17 @@
1826
// complete before the kernel returns from the syscall, so the line
1927
// lands in the runner log even if the process is killed
2028
// milliseconds later.
21-
// 3. Tracks the last-seen test via a mocha root afterEach hook so the
22-
// death point is identified.
23-
// 4. Logs exit-related events so we can discriminate:
29+
// 3. Tracks BOTH the currently-running test (set in beforeEach) and the
30+
// last-finished test (set in afterEach), so the death point can be
31+
// bracketed even when the kill lands inside the next test's setup or
32+
// body, before its own afterEach has a chance to update the pointer.
33+
// 4. Emits a 1Hz heartbeat carrying the running-test name, RSS / heap
34+
// usage, and active handle / request counts. If the process dies
35+
// without firing any JS handler, the last heartbeat narrows the
36+
// kill window to <=1s and the handle-count trace exposes leaks
37+
// (sockets, timers, native bindings) that would otherwise be
38+
// invisible at the runner-log level.
39+
// 5. Logs exit-related events so we can discriminate:
2440
// beforeExit + exit -> clean event-loop drain (Linux CI, local)
2541
// only exit -> process.exit() called — expected when mocha
2642
// is launched with --exit (the Windows CI
@@ -35,7 +51,8 @@
3551
import {writeSync} from 'node:fs';
3652

3753
const t0 = Date.now();
38-
let lastSeenTest = '<no test seen yet>';
54+
let currentTest = '<no test running>';
55+
let lastFinishedTest = '<no test finished yet>';
3956

4057
const diag = (msg: string): void => {
4158
const line = `[diag +${Date.now() - t0}ms] ${msg}\n`;
@@ -48,18 +65,92 @@ const diag = (msg: string): void => {
4865

4966
diag('diagnostics loaded');
5067

68+
// Heartbeat. unref()'d so it never holds the event loop open by itself —
69+
// it only fires if mocha is otherwise alive. The interval cadence (1Hz) is
70+
// the trade-off between log noise (~60-120 extra lines per run) and how
71+
// tightly we can bracket the kill timestamp.
72+
//
73+
// When the backend-test workflow has `--report-directory` set (only the
74+
// Windows jobs do at time of writing), every heartbeat also writes a Node
75+
// diagnostic report into that directory. The previous two failing CI runs
76+
// proved the kill bypasses all JS handlers (uncaughtException, signal,
77+
// beforeExit — none fire), so we can't capture stack state at the moment
78+
// of death. The next-best thing is a rolling 1Hz snapshot of:
79+
// - V8 / native call stacks (all threads)
80+
// - libuv active handles (open TCP connections, timers, file handles)
81+
// - JS heap statistics
82+
// - System info (CPU, memory, environment)
83+
// On the next failure the workflow uploads node-report/ as an artifact,
84+
// and the latest report before the kill bracket gives us 0-1s of pre-death
85+
// state — including, critically, whether the V8 stack is inside jose's
86+
// JWT signing path, supertest's TCP roundtrip, or somewhere else.
87+
// Honor NODE_REPORT_DIR as a local-repro override by pushing it into
88+
// process.report.directory, which is the documented config knob. We can NOT
89+
// pass an absolute path into writeReport(): on Windows the runner sets
90+
// `--report-directory=D:\a\etherpad\etherpad/node-report` (mixed slashes),
91+
// and Node's report writer rejects any subsequent absolute path with errno
92+
// 22 / EINVAL. Pass a bare filename and let Node concatenate it against the
93+
// configured directory using its own platform-correct separator.
94+
if (process.env.NODE_REPORT_DIR && (process as any).report) {
95+
(process as any).report.directory = process.env.NODE_REPORT_DIR;
96+
}
97+
const canWriteReport =
98+
typeof (process as any).report?.writeReport === 'function'
99+
&& !!((process as any).report?.directory
100+
|| (process.env.NODE_OPTIONS || '').includes('--report-directory='));
101+
let reportCounter = 0;
102+
let lastReportT = 0;
103+
104+
// Shared writer used by both the heartbeat tick and the beforeEach hook.
105+
// Throttled by minGapMs so a burst of fast tests doesn't produce hundreds of
106+
// reports — we just need dense enough coverage to bracket the kill.
107+
const tryWriteReport = (prefix: string, minGapMs: number): void => {
108+
if (!canWriteReport) return;
109+
const now = Date.now();
110+
if (now - lastReportT < minGapMs) return;
111+
lastReportT = now;
112+
reportCounter += 1;
113+
const safeTest = currentTest
114+
.replace(/[^a-zA-Z0-9._-]+/g, '_')
115+
.slice(0, 80);
116+
const name = `${prefix}-${String(reportCounter).padStart(4, '0')}-${safeTest}.json`;
117+
try {
118+
// Bare filename only — see comment at canWriteReport definition above.
119+
(process as any).report.writeReport(name);
120+
} catch { /* swallow — diagnostics must not throw */ }
121+
};
122+
const heartbeat = setInterval(() => {
123+
const mem = process.memoryUsage();
124+
// _getActiveHandles / _getActiveRequests are undocumented Node internals.
125+
// The earlier shape `_getActiveHandles?.().length ?? -1` was a bug: `?.()`
126+
// only guards the call, so a missing method returns `undefined` and then
127+
// `.length` throws TypeError — which would take down the whole test run.
128+
// Capture the array first, then read .length only when it actually exists.
129+
const handlesArr = (process as any)._getActiveHandles?.();
130+
const handles = handlesArr ? handlesArr.length : -1;
131+
const requestsArr = (process as any)._getActiveRequests?.();
132+
const requests = requestsArr ? requestsArr.length : -1;
133+
diag(`hb running="${currentTest}" lastFinished="${lastFinishedTest}" ` +
134+
`rss=${Math.round(mem.rss / 1024 / 1024)}M ` +
135+
`heap=${Math.round(mem.heapUsed / 1024 / 1024)}M ` +
136+
`handles=${handles} requests=${requests}`);
137+
// Heartbeat always writes — its 1Hz cadence is the floor.
138+
tryWriteReport('hb', 0);
139+
}, 1000);
140+
heartbeat.unref();
141+
51142
process.on('unhandledRejection', (reason: any) => {
52143
diag(`unhandledRejection: ${
53144
reason && reason.stack ? reason.stack : String(reason)
54-
} (lastTest="${lastSeenTest}")`);
145+
} (running="${currentTest}", lastFinished="${lastFinishedTest}")`);
55146
// Re-throw so existing common.ts handlers / mocha behavior is preserved.
56147
throw reason;
57148
});
58149

59150
process.on('uncaughtException', (err: any) => {
60151
diag(`uncaughtException: ${
61152
err && err.stack ? err.stack : String(err)
62-
} (lastTest="${lastSeenTest}")`);
153+
} (running="${currentTest}", lastFinished="${lastFinishedTest}")`);
63154
// Force fail-fast. Specs that don't import common.ts only have THIS handler,
64155
// and Node won't exit on its own once an uncaughtException listener is
65156
// registered. Without the explicit exit a fatal error would be swallowed.
@@ -69,18 +160,18 @@ process.on('uncaughtException', (err: any) => {
69160

70161
process.on('beforeExit', (code: number) => {
71162
diag(`beforeExit code=${code} exitCode=${process.exitCode} ` +
72-
`lastTest="${lastSeenTest}"`);
163+
`running="${currentTest}" lastFinished="${lastFinishedTest}"`);
73164
});
74165

75166
process.on('exit', (code: number) => {
76-
diag(`exit code=${code} lastTest="${lastSeenTest}"`);
167+
diag(`exit code=${code} running="${currentTest}" lastFinished="${lastFinishedTest}"`);
77168
});
78169

79170
for (const sig of ['SIGINT', 'SIGTERM', 'SIGHUP', 'SIGBREAK'] as const) {
80171
// SIGHUP / SIGBREAK don't exist on every platform; ignore registration errors.
81172
try {
82173
process.on(sig as any, () => {
83-
diag(`received ${sig} (lastTest="${lastSeenTest}")`);
174+
diag(`received ${sig} (running="${currentTest}", lastFinished="${lastFinishedTest}")`);
84175
// Let the default behavior (exit) happen.
85176
process.exit(128);
86177
});
@@ -89,12 +180,36 @@ for (const sig of ['SIGINT', 'SIGTERM', 'SIGHUP', 'SIGBREAK'] as const) {
89180
}
90181
}
91182

92-
// Mocha root hook — only registered if mocha picks up this file via --require.
93-
// We track the most recently-finished test so the death point is visible.
183+
// Mocha root hooks — only registered if mocha picks up this file via --require.
184+
// beforeEach sets the running pointer so a mid-test kill is attributable to a
185+
// specific test, not just the previous one that successfully finished.
186+
//
187+
// We also emit a synchronous diag line on every test start. The 1Hz heartbeat
188+
// misses tests that take less than a second, and the silent backend-test
189+
// kills land ~300 ms after a test boundary — exactly the gap where heartbeat
190+
// resolution fails us. A `start` line per test gives sub-millisecond
191+
// resolution on which test was on the rails when the process died.
94192
export const mochaHooks = {
193+
beforeEach(this: any) {
194+
if (this.currentTest) {
195+
currentTest = this.currentTest.fullTitle();
196+
diag(`test start: ${currentTest}`);
197+
// Drop a node-report at test-boundary granularity when the inter-report
198+
// gap is wide enough. Run 26399285213's rerun caught the kill on the
199+
// socketio.ts duplicate-author test, but the previous boundary write
200+
// had landed 128 ms earlier — inside our 250 ms throttle, so the
201+
// dying test's own beforeEach was suppressed. 100 ms is tighter than
202+
// the inter-test cadence of fast burst suites (~2-5 ms per test, so
203+
// ~20-50× throttled = max ~10 writes/sec) yet still captures
204+
// boundary writes for any test whose neighbour fired ≥100 ms ago,
205+
// including the socketio tests in the dying-test pattern.
206+
tryWriteReport('be', 100);
207+
}
208+
},
95209
afterEach(this: any) {
96210
if (this.currentTest) {
97-
lastSeenTest = this.currentTest.fullTitle();
211+
lastFinishedTest = this.currentTest.fullTitle();
212+
currentTest = '<no test running>';
98213
}
99214
},
100215
};

src/tests/backend/specs/api/importexportGetPost.ts

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -418,8 +418,6 @@ describe(__filename, function () {
418418
// that a buggy makeGoodExport() doesn't cause checks to accidentally pass.
419419
const records = makeGoodExport();
420420
await deleteTestPad();
421-
const importedPads = await importEtherpad(records)
422-
console.log(importedPads)
423421
await importEtherpad(records)
424422
.expect(200)
425423
.expect('Content-Type', /json/)

0 commit comments

Comments
 (0)