Skip to content

fix(api): harden stream history replay against unbounded cold-tier walks#178

Merged
igorls merged 2 commits into
devfrom
harden/stream-replay
Jun 4, 2026
Merged

fix(api): harden stream history replay against unbounded cold-tier walks#178
igorls merged 2 commits into
devfrom
harden/stream-replay

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Jun 4, 2026

Problem

A WAX operator's cold-tier node (data_cold, 136 shards / 13.9 TB at 92% disk) suddenly went to ~82% CPU + 442 MB/s last week, with no cluster or code changes. Diagnosis (cluster green, relocating_shards: 0, _cat/recovery empty, cold node data static) converged on a client doing aggressive stream history replaysstreamPastCommon scrolls <chain>-<type>-* (every index, incl. the oldest cold shards), and a deep start_from walks the entire history.

The replay path had three weaknesses that let a single misbehaving client melt the cold tier:

  1. No effective size cap. stream_scroll_limit gated only when set && !== -1; both unset and -1 meant unlimited — one subscription could scroll all of history.
  2. No concurrency limit. Reconnect storms / many simultaneous deep replays spawned unbounded parallel full-history scrolls.
  3. Scroll-context leaks. Early returns (scroll-limit reject, ack timeout, NACK) returned without clearScroll, pinning old/cold segments (file handles + blocked merges) until the 120s keepalive expired.

Changes (all in streamPastCommon)

  • Concurrency capapi.stream_max_concurrent_replays (default 4, per API process). At capacity, replays are rejected with a clear "server busy, retry" message (backpressure) instead of piling onto the cold tier.
  • Finite scroll cap by defaultstream_scroll_limit now defaults to 50000 when unset. -1 still means unlimited but logs a loud [WARN] past 100k docs so it can't silently run away.
  • Guaranteed scroll cleanup — the scroll context is released in a finally on every exit path (success, reject, timeout, NACK, disconnect).

Plus api.stream_max_concurrent_replays added to the config interface, Zod schema, and config.ref.json; the sample stream_scroll_limit changed from -1 to 50000.

Operator note

Existing configs with stream_scroll_limit: -1 keep unlimited per-replay scrolls (now logged), but the concurrency cap applies regardless (defaults to 4), so deploying this immediately bounds parallel cold-tier walks. Operators should also set a finite stream_scroll_limit.

Testing

tsc --noEmit clean; bun test tests/unit → 121 pass. No change to live streaming or to the REST endpoints.

A stream subscription with a past start_from scrolls <chain>-<type>-* (every index,
including the oldest cold-tier shards). Three hardening guards in streamPastCommon:

- Concurrency cap: bound in-flight history replays per API process
  (api.stream_max_concurrent_replays, default 4) with backpressure, so a reconnect storm or
  many simultaneous deep replays can't spawn parallel full-history scrolls onto the cold tier.
- Finite scroll cap by default: stream_scroll_limit now defaults to 50000 when unset (was
  effectively unlimited for both unset and -1). -1 still means unlimited but logs a loud
  warning past 100k docs.
- Guaranteed scroll cleanup: the scroll context is now released in a finally block on every
  exit path (success, scroll-limit reject, ack timeout, NACK, disconnect). Previously those
  early returns leaked the scroll, pinning old/cold segments until the keepalive expired.

Adds api.stream_max_concurrent_replays to the config schema + reference.
Copilot AI review requested due to automatic review settings June 4, 2026 11:25
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces hardening measures for streaming history replays to prevent overloading cold-tier shards. It implements concurrency limits on history replays per API process, caps the default scroll limit, and ensures that Elasticsearch scroll contexts are reliably cleared using a try...finally block. A critical issue was identified in the scroll loop where an empty hits array could trigger an infinite loop, and a code suggestion was provided to break the loop safely.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +249 to +251
if (rp) {

currentScrollId = rp._scroll_id;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If the Elasticsearch scroll returns an empty page of hits (rp.hits.hits.length === 0) before counter exactly matches getTotalValue(rp) (which can happen due to document updates, index changes, or if track_total_hits is capped), the loop will continue to request more scroll pages indefinitely. This results in an infinite loop that will consume 100% CPU and spam Elasticsearch with scroll requests. Adding a check to break the loop when no hits are returned prevents this critical failure mode.

Suggested change
if (rp) {
currentScrollId = rp._scroll_id;
if (rp) {
if (!rp.hits?.hits || rp.hits.hits.length === 0) {
hLog(`${counter} past ${dataKind}s streamed to ${socket.id} (${totalFiltered} filtered)`);
break;
}
currentScrollId = rp._scroll_id;

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the streaming “history replay” path (streamPastCommon) to prevent unbounded cold-tier index scans from overwhelming Elasticsearch when clients request deep replays and/or reconnect rapidly.

Changes:

  • Adds a per-process concurrency cap for history replays (api.stream_max_concurrent_replays, default 4) with backpressure messaging.
  • Makes history replay size bounded by default (stream_scroll_limit defaults to 50k when unset; -1 remains unbounded but emits a warning past 100k).
  • Ensures Elasticsearch scroll contexts are always cleared via finally to prevent pinning old/cold segments.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/interfaces/hyperionConfig.ts Adds config typing and Zod schema support for stream_max_concurrent_replays and documents stream_scroll_limit default semantics.
src/api/helpers/functions.ts Implements replay concurrency limiting, default scroll caps + warnings for unbounded replays, and guarantees scroll cleanup in finally.
references/config.ref.json Updates reference defaults (stream_scroll_limit: 50000) and adds stream_max_concurrent_replays: 4.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/api/helpers/functions.ts Outdated
Comment on lines +316 to +318
} else {
filterCount++;
hLog(`LOST CLIENT During ${dataKind.toUpperCase()} history replay!`);
break;
Comment thread src/api/helpers/functions.ts Outdated
Comment on lines +171 to +172
const maxConcurrent = fastify.manager.config.api.stream_max_concurrent_replays ?? DEFAULT_MAX_CONCURRENT_REPLAYS;
if (activeHistoryReplays >= maxConcurrent) {
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c640b83fda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/api/helpers/functions.ts Outdated
Comment on lines +196 to +200
const esQuery = {
index: fastify.manager.chain + `-${dataKind}-*`,
scroll: `${MAX_SCROLL_TIME_SEC}s`,
size: fastify.manager.config.api.stream_scroll_batch || 500,
...search_body
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track total hits before enforcing replay caps

When a history query matches more than Elasticsearch's default 10k total-hit tracking window, getTotalValue() sees only 10000 with relation gte. Because the new 50k cap is enforced only from that initial total and the loop also stops when getTotalValue(rp) === counter, a replay with, for example, 60k matching docs will neither be rejected at the configured 50k limit nor stream up to 50k; it silently ends after 10k, and stream_scroll_limit=-1 also never reaches the new warning threshold. Set track_total_hits here or enforce the cap based on scroll batches instead of the capped total.

Useful? React with 👍 / 👎.

- CRITICAL (gemini): break the scroll loop on an empty hits page. Relying solely on
  counter === total could loop forever scrolling empty pages (deletes/merges mid-scroll, or a
  capped total counter never reaches) — pegging CPU and spamming ES.
- Track total hits to (cap + 1) on the replay query (codex): ES only tracks 10k by default, so
  a 60k-doc replay would slip a 50k cap and silently truncate at 10k. Now the cap is enforced
  accurately and replays stream to their true end.
- Return a failure status on client disconnect (copilot): previously returned status:true, so
  the caller launched up to 3 follow-up 'fill' replays scrolling the cold tier for a dead socket.
- Normalize stream_max_concurrent_replays against NaN/negative (copilot) so a bad config can't
  silently disable the cap.
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Jun 4, 2026

Thanks — all four addressed in a620e73:

  • CRITICAL (gemini) — infinite empty-scroll loop: the loop now breaks on an empty hits page (!rp.hits?.hits || length === 0), the authoritative scroll terminator. Relying on counter === total alone could spin forever (deletes/merges mid-scroll, or a capped total counter never lands on) — pegging CPU and spamming ES, the exact failure this PR exists to prevent. Good catch.
  • (codex) — track_total_hits defeats the cap: ES tracked only 10k by default, so a 60k-doc replay slipped the 50k cap and silently truncated at 10k. The replay query now sets track_total_hits to cap + 1 (or warn + 1 for -1), so the cap is enforced accurately and in-cap replays stream to their true end.
  • (copilot) — disconnect returned success: on LOST CLIENT it now returns status: false, so the caller stops instead of launching its 3 follow-up "fill" replays against the cold tier for a dead socket.
  • (copilot) — config hardening: stream_max_concurrent_replays is normalized to a finite ≥1 value so a NaN/negative config can't silently remove the cap.

tsc clean; 121 unit tests pass.

@igorls igorls merged commit 13651ee into dev Jun 4, 2026
2 checks passed
@igorls igorls deleted the harden/stream-replay branch June 4, 2026 11:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a620e7319c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +209 to +211
const trackTotalHits = scrollLimit === -1
? UNBOUNDED_REPLAY_WARN_THRESHOLD + 1
: scrollLimit + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid treating the unlimited warning threshold as completion

When api.stream_scroll_limit is explicitly -1, this search only tracks hits up to 100001; if the real replay is larger and stream_scroll_batch divides that threshold (for example batch 1, 11, or 100001), getTotalValue(rp) === counter becomes true at the tracked lower bound and the supposedly unlimited replay stops early. The fresh evidence is this new capped track_total_hits value for the -1 path; completion needs to distinguish an exact total from a gte lower bound or avoid a numeric cap for unlimited replays.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants