Skip to content

test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE#7838

Merged
JohnMcLear merged 7 commits into
developfrom
test-ci-diagnostics-heartbeat-currenttest
May 25, 2026
Merged

test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE#7838
JohnMcLear merged 7 commits into
developfrom
test-ci-diagnostics-heartbeat-currenttest

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

  • Track the currently-running test (beforeEach), not just the last-finished one (afterEach), so a kill mid-test or in setup attributes correctly.
  • Add a 1Hz heartbeat with running-test name, RSS / heap, and active-handle / request counts. unref()'d so it never holds the event loop open.

Why

Backend tests silently die with exit code 255 mid-suite ~22% of the time on develop — most often Windows with Plugins (24), occasionally Linux too. Each kill lands ~300 ms after the previous test's clean ✔ teardown line and emits no failing-test marker, no error, no Mocha summary. The handlers in diagnostics.ts (PR #7663) and common.ts don't fire either: only [diag +0ms] diagnostics loaded lands before the silent ELIFECYCLE.

Recent reproduction: run 26311025244. Both attempts crashed at completely different "last test" locations (messages.ts > identity changeset is accepted in attempt 1; clientvar_rev_consistency.ts > CLIENT_VARS stays consistent under concurrent edits during handshake (delay race) in attempt 2), so the dying test itself isn't to blame.

The current lastSeenTest pointer is only updated in afterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300 ms gap we observe — the pointer points at the previous passing test, not the one being killed. The new running pointer + heartbeat narrow the kill window to ≤1 s and expose handle/memory behavior leading up to the kill.

What this changes

Pure diagnostic. No behavior change on successful runs.

  • mochaHooks.beforeEach now sets currentTest.
  • mochaHooks.afterEach continues to update lastFinishedTest and clears currentTest.
  • Every diag line carries both pointers.
  • New 1Hz heartbeat via writeSync(2, ...) + process.memoryUsage() + _getActiveHandles() / _getActiveRequests().

Cost: ~60-120 extra log lines per CI run.

Locally verified

Ran mocha against a 3.5 s probe spec:

[diag +0ms] diagnostics loaded
[diag +1002ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +2002ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +3003ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +3565ms] beforeExit code=0 ... lastFinished="heartbeat probe runs for 3.5s"
[diag +3565ms] exit code=0 ... lastFinished="heartbeat probe runs for 3.5s"

Test plan

  • Existing backend-test matrix passes (Linux ± plugins, Windows without plugins).
  • Windows-with-plugins run produces visible heartbeat lines.
  • On the next reproduction of the silent flake, the log contains a final heartbeat naming the running test + handle count immediately before ELIFECYCLE.

🤖 Generated with Claude Code

Backend tests silently die with code 255 mid-suite ~22% of the time on
develop (most often Windows-with-plugins, Node 24). Each kill lands
300±50 ms after the previous test's clean ✔ teardown line and produces
no failing-test marker, no error, no Mocha summary, and — despite the
unconditional handlers in `diagnostics.ts` — none of the JS-level death
events fire either. Recent example: run 26311025244 (`Windows with
Plugins (24)`); both attempts crashed at completely different "last
test" locations, so the dying test itself isn't to blame.

The existing diagnostics only set lastSeenTest in afterEach, so if the
kill lands during the NEXT test's setup or body — which is exactly the
~300ms gap we observe — the pointer reads as the previous (passing)
test. That hides whether we're between tests or inside one, and which
one.

Two changes:

1. Track currentTest in beforeEach as well as lastFinishedTest in
   afterEach. Every diag line now carries both, so the death point is
   bracketable regardless of which lifecycle phase the kill interrupts.

2. Add a 1Hz heartbeat that writeSyncs the running-test name plus
   `process.memoryUsage()` (rss, heap) and the active-handle and
   active-request counts. The interval is unref'd so it never holds the
   event loop open by itself. Cost is roughly one extra log line per
   second of mocha runtime (~60-120 lines per CI run).

When the next failure fires, the last heartbeat narrows the kill window
to ≤1s, the running pointer names the test on the rails at that moment,
and the handle/memory trace gives a sparkline that exposes sudden
spikes — a leaked socket, an unref'd timer, a runaway map — that
would otherwise be invisible at the runner-log level.

No behavior change on successful runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Add heartbeat and running-test tracking to backend test diagnostics

🧪 Tests ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Track currently-running test in beforeEach, not just last-finished in afterEach
• Add 1Hz heartbeat logging memory usage and active handles/requests
• Narrow silent backend-test kill window from ~300ms to ≤1s
• Expose handle/memory leaks invisible at runner-log level
Diagram
flowchart LR
  A["Test Lifecycle"] -->|beforeEach| B["Set currentTest"]
  B -->|test runs| C["1Hz Heartbeat"]
  C -->|logs| D["Memory + Handles + Requests"]
  A -->|afterEach| E["Set lastFinishedTest"]
  E -->|clear| F["currentTest"]
  G["Process Death"] -->|kill signal| H["Last Heartbeat"]
  H -->|narrows window| I["≤1s bracket"]

Loading

File Changes

1. src/tests/backend/diagnostics.ts ✨ Enhancement +51/-12

Add heartbeat and dual test-pointer tracking

• Split lastSeenTest into currentTest (set in beforeEach) and lastFinishedTest (set in
 afterEach)
• Added 1Hz heartbeat via setInterval that logs running test name, RSS/heap memory usage, and
 active handle/request counts
• Updated all diagnostic log lines to include both running and lastFinished pointers instead of
 single lastTest
• Heartbeat interval is unref()'d to prevent holding event loop open
• Enhanced comments explaining the silent flake diagnosis and new tracking strategy

src/tests/backend/diagnostics.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 25, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Heartbeat logging lacks feature flag 📘 Rule violation ⚙ Maintainability
Description
The new 1Hz heartbeat logging runs unconditionally whenever diagnostics.ts is loaded, adding
repeated stderr output and potential overhead with no enable/disable mechanism. This introduces new
behavior without a feature flag and is not disabled by default.
Code

src/tests/backend/diagnostics.ts[R68-81]

Evidence
PR Compliance ID 5 requires new functionality to be behind a feature flag and disabled by default.
The code introduces a heartbeat via setInterval that logs every second, and it is started
unconditionally without any flag/guard.

src/tests/backend/diagnostics.ts[68-81]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new heartbeat feature (`setInterval` that logs memory/handles every second) is always enabled when `src/tests/backend/diagnostics.ts` is required. The compliance checklist requires new features to be guarded behind a feature flag and disabled by default.
## Issue Context
The heartbeat currently starts unconditionally and writes periodic diagnostic lines to stderr. Add an explicit opt-in flag (for example an environment variable) so default behavior remains unchanged unless enabled.
## Fix Focus Areas
- src/tests/backend/diagnostics.ts[68-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Heartbeat optional-chain TypeError ✓ Resolved 🐞 Bug ☼ Reliability
Description
The heartbeat computes handle/request counts via (process as any)._getActiveHandles?.().length and
_getActiveRequests?.().length, but ?.() only guards the call — if the method is missing it
returns undefined and the subsequent .length access throws a TypeError. Because diagnostics.ts
is loaded via mocha --require in the main backend test script, this would take down the whole test
run on Node builds where those private APIs are absent/changed.
Code

src/tests/backend/diagnostics.ts[R74-75]

Evidence
The heartbeat interval reads _getActiveHandles/_getActiveRequests with ?.() but immediately
dereferences .length on the result, which will throw if the function is undefined; the backend
test script always --requires this file, so such a throw would affect all test runs.

src/tests/backend/diagnostics.ts[68-81]
src/package.json[150-153]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The heartbeat uses optional chaining only on the private API call (`?.()`), but not on the returned value before accessing `.length`, so the fallback `?? -1` is never reached if the API is missing.
## Issue Context
This file is always loaded for backend tests via `--require ./tests/backend/diagnostics.ts`, so an exception inside the heartbeat interval can crash the test process.
## Fix
Change both expressions to optionally chain the returned value as well (or otherwise guard the length access).
Example:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment on lines +68 to +81
// Heartbeat. unref()'d so it never holds the event loop open by itself —
// it only fires if mocha is otherwise alive. The interval cadence (1Hz) is
// the trade-off between log noise (~60-120 extra lines per run) and how
// tightly we can bracket the kill timestamp.
const heartbeat = setInterval(() => {
const mem = process.memoryUsage();
const handles = (process as any)._getActiveHandles?.().length ?? -1;
const requests = (process as any)._getActiveRequests?.().length ?? -1;
diag(`hb running="${currentTest}" lastFinished="${lastFinishedTest}" ` +
`rss=${Math.round(mem.rss / 1024 / 1024)}M ` +
`heap=${Math.round(mem.heapUsed / 1024 / 1024)}M ` +
`handles=${handles} requests=${requests}`);
}, 1000);
heartbeat.unref();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Heartbeat logging lacks feature flag 📘 Rule violation ⚙ Maintainability

The new 1Hz heartbeat logging runs unconditionally whenever diagnostics.ts is loaded, adding
repeated stderr output and potential overhead with no enable/disable mechanism. This introduces new
behavior without a feature flag and is not disabled by default.
Agent Prompt
## Issue description
A new heartbeat feature (`setInterval` that logs memory/handles every second) is always enabled when `src/tests/backend/diagnostics.ts` is required. The compliance checklist requires new features to be guarded behind a feature flag and disabled by default.

## Issue Context
The heartbeat currently starts unconditionally and writes periodic diagnostic lines to stderr. Add an explicit opt-in flag (for example an environment variable) so default behavior remains unchanged unless enabled.

## Fix Focus Areas
- src/tests/backend/diagnostics.ts[68-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern #2 (the optional-chain TypeError) is fixed in 9b79a76 — capture the array first, read .length only when it exists, keep -1 as the "API missing" sentinel. Good catch.

Concern #1 (feature flag): respectfully disagreeing. diagnostics.ts is a temporary debug-only bootstrap (top-of-file comment: "Drop this file once the flake's root cause is identified and fixed.") wired in via mocha --require in the backend test script. The whole file is already opt-in at the test-runner layer — adding an internal flag would just multiply the existing toggle, and defaulting it off would silence the diagnostic in CI which is the one environment where the silent flake reliably reproduces. The repo's pattern for diagnostic instrumentation (see PR #7663, which added the unhandled-rejection logger with no flag) is to land it unguarded and remove the file once the kill is rooted. Keeping that pattern here.

Qodo correctly flagged `_getActiveHandles?.().length` as a latent
TypeError: `?.()` guards the call but the call's `undefined` return
on a missing method still hits `.length`, which throws. Since the
heartbeat fires on a setInterval inside the mocha bootstrap, a Node
build without the underscore-prefixed internals would take down the
whole backend test run.

Capture the array first, then read `.length` only when it actually
exists. -1 stays as the "API missing" sentinel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the heartbeat PR after run 26397693748 confirmed the
diagnostic works (the kill landed at importexportGetPost.ts
'Import authorization checks > authn anonymous !exist -> fail',
~300 ms after the previous test's ✔). Two cleanups so the next
failure pinpoints faster and reads cleaner:

1. diagnostics.ts: emit a `test start: <name>` diag line in the
   mocha beforeEach hook, after setting the currentTest pointer.
   The 1Hz heartbeat misses tests that take less than a second,
   and the silent kills land ~300 ms after a test boundary —
   precisely the gap where heartbeat resolution fails. A start
   line per test gives sub-millisecond resolution on which test
   was on the rails when the process died.

2. specs/api/importexportGetPost.ts: drop a stray
   `console.log(importedPads)` debug leftover (and the duplicate
   `await importEtherpad(records)` only present to feed it) in
   the `malformed .etherpad files are rejected` block. The leftover
   dumped a ~600-line reflection of a supertest Response object
   to the CI log on every successful run, drowning the surrounding
   test output and making the silent-kill window much harder to
   read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398054688 narrowed the kill to a specific test
(pad.ts > Gets text on a pad Id and doesn't have an excess newline)
but the test body is a trivial supertest GET — the kill bypasses
all JS handlers, so we can't capture stack state at death.
Two failures across two runs share the shape: an agent.{get,post}
+ common.generateJWTToken() call dies ~300-600 ms after test start,
with no JS-visible cause. The next step is V8 + native stack.

Hook into the existing 1Hz heartbeat to call
process.report.writeReport(path) whenever a report directory is set.
The Windows backend-tests workflow already wires up
`--report-directory=${{ github.workspace }}/node-report` via
NODE_OPTIONS and uploads that directory as an artifact on failure,
so the rolling snapshots ride for free on the existing upload step.

Each report (~50 KB) contains:
  - V8 + native call stacks for all threads
  - libuv active handles (open TCP, timers, file handles)
  - JS heap statistics
  - resourceUsage + system info
  - shared-object list

On the next reproduction the latest report before ELIFECYCLE will
sit ~0-1 s before the kill — enough to see whether the V8 stack
is inside jose's WebCrypto sign path, inside supertest's TCP
roundtrip, or somewhere unexpected entirely.

NODE_REPORT_DIR is also honored as an explicit override for local
repro / non-workflow runs.

Cost: ~6 files (~300 KB) per Windows backend-test failure, plus
~50 ms event-loop pause per heartbeat. No-op when neither env var
is set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398830249 exposed the path-separator bug in the previous commit:
every heartbeat tick on the Windows runner logged

  Failed to open Node.js report file:
  D:\a\etherpad\etherpad/node-report/hb-NNNN-...json
  directory: D:\a\etherpad\etherpad/node-report (errno: 22)

— EINVAL. The workflow sets --report-directory with forward-slash
separators on Windows, then this code concatenated another `/` plus
the filename, producing a path Node's report writer rejects.

writeReport(fileName) takes a BARE filename and resolves it against
the configured report directory using the platform-correct separator
internally. Switch to that. For local repro overrides via
NODE_REPORT_DIR, push the path into process.report.directory (the
documented config knob) instead of joining it into the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398985832 proved the heartbeat-only report cadence isn't tight
enough: the last report before the kill was hb-0013 at +16201ms,
~1.5 s before ELIFECYCLE at +17701ms — during which ~30 tests fired,
including the dying one (`authn anonymous !exist -> fail`). The
captured V8 stack is just our heartbeat code, not the dying test.

Move the writeReport call to a shared tryWriteReport() helper and
invoke it from BOTH the heartbeat AND mocha's beforeEach hook,
throttled to one report per 250 ms. That gives ≤250 ms resolution
on the kill window — close enough that the latest report captures
state from inside the dying test rather than from the test ~30
slots earlier. The heartbeat always writes (so we don't lose the
no-test-running ticks during setup); beforeEach only writes when
the throttle window has elapsed.

Cost ceiling: ~4 reports/sec × ~12 s test phase ≈ 48 reports
(~2.5 MB) per failing run. Each writeReport adds ~50 ms of
event-loop pause — at 4Hz that's 20% of wall time spent in
diagnostics, which is acceptable for a temporary debug-only
bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26399285213's rerun captured a sixth death point on the new 4Hz
cadence (`socketio.ts > Duplicate-author handling > cookie identity:
same-author second socket kicks the first`, kill at +45953ms, 271ms
after test start). The throttle suppressed the dying test's own
beforeEach: previous boundary write landed 128 ms earlier and the
next 31 ms after that, both inside the 250 ms window. Last captured
report (be-0100) is from the previous test.

100 ms is still well above the inter-test cadence in fast burst
suites (tests fire 2-5 ms apart, so 20-50 of them get throttled to a
single write, ceiling ~10 writes/sec). But it's tight enough that
any death-window neighbour ≥100 ms after the previous report — the
shape we keep observing — gets its own boundary snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear merged commit 98dbba4 into develop May 25, 2026
40 checks passed
@JohnMcLear JohnMcLear deleted the test-ci-diagnostics-heartbeat-currenttest branch May 25, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant