Commit 98dbba4
test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE (#7838)
* test(ci): heartbeat + running-test pointer in backend test diagnostics
Backend tests silently die with code 255 mid-suite ~22% of the time on
develop (most often Windows-with-plugins, Node 24). Each kill lands
300±50 ms after the previous test's clean ✔ teardown line and produces
no failing-test marker, no error, no Mocha summary, and — despite the
unconditional handlers in `diagnostics.ts` — none of the JS-level death
events fire either. Recent example: run 26311025244 (`Windows with
Plugins (24)`); both attempts crashed at completely different "last
test" locations, so the dying test itself isn't to blame.
The existing diagnostics only set lastSeenTest in afterEach, so if the
kill lands during the NEXT test's setup or body — which is exactly the
~300ms gap we observe — the pointer reads as the previous (passing)
test. That hides whether we're between tests or inside one, and which
one.
Two changes:
1. Track currentTest in beforeEach as well as lastFinishedTest in
afterEach. Every diag line now carries both, so the death point is
bracketable regardless of which lifecycle phase the kill interrupts.
2. Add a 1Hz heartbeat that writeSyncs the running-test name plus
`process.memoryUsage()` (rss, heap) and the active-handle and
active-request counts. The interval is unref'd so it never holds the
event loop open by itself. Cost is roughly one extra log line per
second of mocha runtime (~60-120 lines per CI run).
When the next failure fires, the last heartbeat narrows the kill window
to ≤1s, the running pointer names the test on the rails at that moment,
and the handle/memory trace gives a sparkline that exposes sudden
spikes — a leaked socket, an unref'd timer, a runaway map — that
would otherwise be invisible at the runner-log level.
No behavior change on successful runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: heartbeat _getActiveHandles optional chain bug (Qodo #2)
Qodo correctly flagged `_getActiveHandles?.().length` as a latent
TypeError: `?.()` guards the call but the call's `undefined` return
on a missing method still hits `.length`, which throws. Since the
heartbeat fires on a setInterval inside the mocha bootstrap, a Node
build without the underscore-prefixed internals would take down the
whole backend test run.
Capture the array first, then read `.length` only when it actually
exists. -1 stays as the "API missing" sentinel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(ci): per-test start diag + drop stray console.log noise
Follow-up to the heartbeat PR after run 26397693748 confirmed the
diagnostic works (the kill landed at importexportGetPost.ts
'Import authorization checks > authn anonymous !exist -> fail',
~300 ms after the previous test's ✔). Two cleanups so the next
failure pinpoints faster and reads cleaner:
1. diagnostics.ts: emit a `test start: <name>` diag line in the
mocha beforeEach hook, after setting the currentTest pointer.
The 1Hz heartbeat misses tests that take less than a second,
and the silent kills land ~300 ms after a test boundary —
precisely the gap where heartbeat resolution fails. A start
line per test gives sub-millisecond resolution on which test
was on the rails when the process died.
2. specs/api/importexportGetPost.ts: drop a stray
`console.log(importedPads)` debug leftover (and the duplicate
`await importEtherpad(records)` only present to feed it) in
the `malformed .etherpad files are rejected` block. The leftover
dumped a ~600-line reflection of a supertest Response object
to the CI log on every successful run, drowning the surrounding
test output and making the silent-kill window much harder to
read.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(ci): write node-report on every heartbeat tick
Run 26398054688 narrowed the kill to a specific test
(pad.ts > Gets text on a pad Id and doesn't have an excess newline)
but the test body is a trivial supertest GET — the kill bypasses
all JS handlers, so we can't capture stack state at death.
Two failures across two runs share the shape: an agent.{get,post}
+ common.generateJWTToken() call dies ~300-600 ms after test start,
with no JS-visible cause. The next step is V8 + native stack.
Hook into the existing 1Hz heartbeat to call
process.report.writeReport(path) whenever a report directory is set.
The Windows backend-tests workflow already wires up
`--report-directory=${{ github.workspace }}/node-report` via
NODE_OPTIONS and uploads that directory as an artifact on failure,
so the rolling snapshots ride for free on the existing upload step.
Each report (~50 KB) contains:
- V8 + native call stacks for all threads
- libuv active handles (open TCP, timers, file handles)
- JS heap statistics
- resourceUsage + system info
- shared-object list
On the next reproduction the latest report before ELIFECYCLE will
sit ~0-1 s before the kill — enough to see whether the V8 stack
is inside jose's WebCrypto sign path, inside supertest's TCP
roundtrip, or somewhere unexpected entirely.
NODE_REPORT_DIR is also honored as an explicit override for local
repro / non-workflow runs.
Cost: ~6 files (~300 KB) per Windows backend-test failure, plus
~50 ms event-loop pause per heartbeat. No-op when neither env var
is set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: writeReport with bare filename, not mixed-slash absolute path
Run 26398830249 exposed the path-separator bug in the previous commit:
every heartbeat tick on the Windows runner logged
Failed to open Node.js report file:
D:\a\etherpad\etherpad/node-report/hb-NNNN-...json
directory: D:\a\etherpad\etherpad/node-report (errno: 22)
— EINVAL. The workflow sets --report-directory with forward-slash
separators on Windows, then this code concatenated another `/` plus
the filename, producing a path Node's report writer rejects.
writeReport(fileName) takes a BARE filename and resolves it against
the configured report directory using the platform-correct separator
internally. Switch to that. For local repro overrides via
NODE_REPORT_DIR, push the path into process.report.directory (the
documented config knob) instead of joining it into the call site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(ci): write node-report on test boundaries, throttled to 4Hz
Run 26398985832 proved the heartbeat-only report cadence isn't tight
enough: the last report before the kill was hb-0013 at +16201ms,
~1.5 s before ELIFECYCLE at +17701ms — during which ~30 tests fired,
including the dying one (`authn anonymous !exist -> fail`). The
captured V8 stack is just our heartbeat code, not the dying test.
Move the writeReport call to a shared tryWriteReport() helper and
invoke it from BOTH the heartbeat AND mocha's beforeEach hook,
throttled to one report per 250 ms. That gives ≤250 ms resolution
on the kill window — close enough that the latest report captures
state from inside the dying test rather than from the test ~30
slots earlier. The heartbeat always writes (so we don't lose the
no-test-running ticks during setup); beforeEach only writes when
the throttle window has elapsed.
Cost ceiling: ~4 reports/sec × ~12 s test phase ≈ 48 reports
(~2.5 MB) per failing run. Each writeReport adds ~50 ms of
event-loop pause — at 4Hz that's 20% of wall time spent in
diagnostics, which is acceptable for a temporary debug-only
bootstrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(ci): drop beforeEach report throttle from 250ms to 100ms
Run 26399285213's rerun captured a sixth death point on the new 4Hz
cadence (`socketio.ts > Duplicate-author handling > cookie identity:
same-author second socket kicks the first`, kill at +45953ms, 271ms
after test start). The throttle suppressed the dying test's own
beforeEach: previous boundary write landed 128 ms earlier and the
next 31 ms after that, both inside the 250 ms window. Last captured
report (be-0100) is from the previous test.
100 ms is still well above the inter-test cadence in fast burst
suites (tests fire 2-5 ms apart, so 20-50 of them get throttled to a
single write, ceiling ~10 writes/sec). But it's tight enough that
any death-window neighbour ≥100 ms after the previous report — the
shape we keep observing — gets its own boundary snapshot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d9dabe3 commit 98dbba4
2 files changed
Lines changed: 127 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
13 | 21 | | |
14 | 22 | | |
15 | 23 | | |
| |||
18 | 26 | | |
19 | 27 | | |
20 | 28 | | |
21 | | - | |
22 | | - | |
23 | | - | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
24 | 40 | | |
25 | 41 | | |
26 | 42 | | |
| |||
35 | 51 | | |
36 | 52 | | |
37 | 53 | | |
38 | | - | |
| 54 | + | |
| 55 | + | |
39 | 56 | | |
40 | 57 | | |
41 | 58 | | |
| |||
48 | 65 | | |
49 | 66 | | |
50 | 67 | | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
51 | 142 | | |
52 | 143 | | |
53 | 144 | | |
54 | | - | |
| 145 | + | |
55 | 146 | | |
56 | 147 | | |
57 | 148 | | |
58 | 149 | | |
59 | 150 | | |
60 | 151 | | |
61 | 152 | | |
62 | | - | |
| 153 | + | |
63 | 154 | | |
64 | 155 | | |
65 | 156 | | |
| |||
69 | 160 | | |
70 | 161 | | |
71 | 162 | | |
72 | | - | |
| 163 | + | |
73 | 164 | | |
74 | 165 | | |
75 | 166 | | |
76 | | - | |
| 167 | + | |
77 | 168 | | |
78 | 169 | | |
79 | 170 | | |
80 | 171 | | |
81 | 172 | | |
82 | 173 | | |
83 | | - | |
| 174 | + | |
84 | 175 | | |
85 | 176 | | |
86 | 177 | | |
| |||
89 | 180 | | |
90 | 181 | | |
91 | 182 | | |
92 | | - | |
93 | | - | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
94 | 192 | | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
95 | 209 | | |
96 | 210 | | |
97 | | - | |
| 211 | + | |
| 212 | + | |
98 | 213 | | |
99 | 214 | | |
100 | 215 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
418 | 418 | | |
419 | 419 | | |
420 | 420 | | |
421 | | - | |
422 | | - | |
423 | 421 | | |
424 | 422 | | |
425 | 423 | | |
| |||
0 commit comments