fix(api): harden stream history replay against unbounded cold-tier walks#178
Conversation
A stream subscription with a past start_from scrolls <chain>-<type>-* (every index, including the oldest cold-tier shards). Three hardening guards in streamPastCommon: - Concurrency cap: bound in-flight history replays per API process (api.stream_max_concurrent_replays, default 4) with backpressure, so a reconnect storm or many simultaneous deep replays can't spawn parallel full-history scrolls onto the cold tier. - Finite scroll cap by default: stream_scroll_limit now defaults to 50000 when unset (was effectively unlimited for both unset and -1). -1 still means unlimited but logs a loud warning past 100k docs. - Guaranteed scroll cleanup: the scroll context is now released in a finally block on every exit path (success, scroll-limit reject, ack timeout, NACK, disconnect). Previously those early returns leaked the scroll, pinning old/cold segments until the keepalive expired. Adds api.stream_max_concurrent_replays to the config schema + reference.
There was a problem hiding this comment.
Code Review
This pull request introduces hardening measures for streaming history replays to prevent overloading cold-tier shards. It implements concurrency limits on history replays per API process, caps the default scroll limit, and ensures that Elasticsearch scroll contexts are reliably cleared using a try...finally block. A critical issue was identified in the scroll loop where an empty hits array could trigger an infinite loop, and a code suggestion was provided to break the loop safely.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (rp) { | ||
|
|
||
| currentScrollId = rp._scroll_id; |
There was a problem hiding this comment.
If the Elasticsearch scroll returns an empty page of hits (rp.hits.hits.length === 0) before counter exactly matches getTotalValue(rp) (which can happen due to document updates, index changes, or if track_total_hits is capped), the loop will continue to request more scroll pages indefinitely. This results in an infinite loop that will consume 100% CPU and spam Elasticsearch with scroll requests. Adding a check to break the loop when no hits are returned prevents this critical failure mode.
| if (rp) { | |
| currentScrollId = rp._scroll_id; | |
| if (rp) { | |
| if (!rp.hits?.hits || rp.hits.hits.length === 0) { | |
| hLog(`${counter} past ${dataKind}s streamed to ${socket.id} (${totalFiltered} filtered)`); | |
| break; | |
| } | |
| currentScrollId = rp._scroll_id; |
There was a problem hiding this comment.
Pull request overview
Hardens the streaming “history replay” path (streamPastCommon) to prevent unbounded cold-tier index scans from overwhelming Elasticsearch when clients request deep replays and/or reconnect rapidly.
Changes:
- Adds a per-process concurrency cap for history replays (
api.stream_max_concurrent_replays, default 4) with backpressure messaging. - Makes history replay size bounded by default (
stream_scroll_limitdefaults to 50k when unset;-1remains unbounded but emits a warning past 100k). - Ensures Elasticsearch scroll contexts are always cleared via
finallyto prevent pinning old/cold segments.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/interfaces/hyperionConfig.ts |
Adds config typing and Zod schema support for stream_max_concurrent_replays and documents stream_scroll_limit default semantics. |
src/api/helpers/functions.ts |
Implements replay concurrency limiting, default scroll caps + warnings for unbounded replays, and guarantees scroll cleanup in finally. |
references/config.ref.json |
Updates reference defaults (stream_scroll_limit: 50000) and adds stream_max_concurrent_replays: 4. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| filterCount++; | ||
| hLog(`LOST CLIENT During ${dataKind.toUpperCase()} history replay!`); | ||
| break; |
| const maxConcurrent = fastify.manager.config.api.stream_max_concurrent_replays ?? DEFAULT_MAX_CONCURRENT_REPLAYS; | ||
| if (activeHistoryReplays >= maxConcurrent) { |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c640b83fda
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const esQuery = { | ||
| index: fastify.manager.chain + `-${dataKind}-*`, | ||
| scroll: `${MAX_SCROLL_TIME_SEC}s`, | ||
| size: fastify.manager.config.api.stream_scroll_batch || 500, | ||
| ...search_body |
There was a problem hiding this comment.
Track total hits before enforcing replay caps
When a history query matches more than Elasticsearch's default 10k total-hit tracking window, getTotalValue() sees only 10000 with relation gte. Because the new 50k cap is enforced only from that initial total and the loop also stops when getTotalValue(rp) === counter, a replay with, for example, 60k matching docs will neither be rejected at the configured 50k limit nor stream up to 50k; it silently ends after 10k, and stream_scroll_limit=-1 also never reaches the new warning threshold. Set track_total_hits here or enforce the cap based on scroll batches instead of the capped total.
Useful? React with 👍 / 👎.
- CRITICAL (gemini): break the scroll loop on an empty hits page. Relying solely on counter === total could loop forever scrolling empty pages (deletes/merges mid-scroll, or a capped total counter never reaches) — pegging CPU and spamming ES. - Track total hits to (cap + 1) on the replay query (codex): ES only tracks 10k by default, so a 60k-doc replay would slip a 50k cap and silently truncate at 10k. Now the cap is enforced accurately and replays stream to their true end. - Return a failure status on client disconnect (copilot): previously returned status:true, so the caller launched up to 3 follow-up 'fill' replays scrolling the cold tier for a dead socket. - Normalize stream_max_concurrent_replays against NaN/negative (copilot) so a bad config can't silently disable the cap.
|
Thanks — all four addressed in
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a620e7319c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const trackTotalHits = scrollLimit === -1 | ||
| ? UNBOUNDED_REPLAY_WARN_THRESHOLD + 1 | ||
| : scrollLimit + 1; |
There was a problem hiding this comment.
Avoid treating the unlimited warning threshold as completion
When api.stream_scroll_limit is explicitly -1, this search only tracks hits up to 100001; if the real replay is larger and stream_scroll_batch divides that threshold (for example batch 1, 11, or 100001), getTotalValue(rp) === counter becomes true at the tracked lower bound and the supposedly unlimited replay stops early. The fresh evidence is this new capped track_total_hits value for the -1 path; completion needs to distinguish an exact total from a gte lower bound or avoid a numeric cap for unlimited replays.
Useful? React with 👍 / 👎.
Problem
A WAX operator's cold-tier node (data_cold, 136 shards / 13.9 TB at 92% disk) suddenly went to ~82% CPU + 442 MB/s last week, with no cluster or code changes. Diagnosis (cluster green,
relocating_shards: 0,_cat/recoveryempty, cold node data static) converged on a client doing aggressive stream history replays —streamPastCommonscrolls<chain>-<type>-*(every index, incl. the oldest cold shards), and a deepstart_fromwalks the entire history.The replay path had three weaknesses that let a single misbehaving client melt the cold tier:
stream_scroll_limitgated only whenset && !== -1; both unset and-1meant unlimited — one subscription could scroll all of history.clearScroll, pinning old/cold segments (file handles + blocked merges) until the 120s keepalive expired.Changes (all in
streamPastCommon)api.stream_max_concurrent_replays(default 4, per API process). At capacity, replays are rejected with a clear "server busy, retry" message (backpressure) instead of piling onto the cold tier.stream_scroll_limitnow defaults to 50000 when unset.-1still means unlimited but logs a loud[WARN]past 100k docs so it can't silently run away.finallyon every exit path (success, reject, timeout, NACK, disconnect).Plus
api.stream_max_concurrent_replaysadded to the config interface, Zod schema, andconfig.ref.json; the samplestream_scroll_limitchanged from-1to50000.Operator note
Existing configs with
stream_scroll_limit: -1keep unlimited per-replay scrolls (now logged), but the concurrency cap applies regardless (defaults to 4), so deploying this immediately bounds parallel cold-tier walks. Operators should also set a finitestream_scroll_limit.Testing
tsc --noEmitclean;bun test tests/unit→ 121 pass. No change to live streaming or to the REST endpoints.