fix(api): harden stream history replay against unbounded cold-tier walks by igorls · Pull Request #178 · eosrio/hyperion-history-api

igorls · 2026-06-04T11:25:37Z

Problem

A WAX operator's cold-tier node (data_cold, 136 shards / 13.9 TB at 92% disk) suddenly went to ~82% CPU + 442 MB/s last week, with no cluster or code changes. Diagnosis (cluster green, relocating_shards: 0, _cat/recovery empty, cold node data static) converged on a client doing aggressive stream history replays — streamPastCommon scrolls <chain>-<type>-* (every index, incl. the oldest cold shards), and a deep start_from walks the entire history.

The replay path had three weaknesses that let a single misbehaving client melt the cold tier:

No effective size cap. stream_scroll_limit gated only when set && !== -1; both unset and -1 meant unlimited — one subscription could scroll all of history.
No concurrency limit. Reconnect storms / many simultaneous deep replays spawned unbounded parallel full-history scrolls.
Scroll-context leaks. Early returns (scroll-limit reject, ack timeout, NACK) returned without clearScroll, pinning old/cold segments (file handles + blocked merges) until the 120s keepalive expired.

Changes (all in `streamPastCommon`)

Concurrency cap — api.stream_max_concurrent_replays (default 4, per API process). At capacity, replays are rejected with a clear "server busy, retry" message (backpressure) instead of piling onto the cold tier.
Finite scroll cap by default — stream_scroll_limit now defaults to 50000 when unset. -1 still means unlimited but logs a loud [WARN] past 100k docs so it can't silently run away.
Guaranteed scroll cleanup — the scroll context is released in a finally on every exit path (success, reject, timeout, NACK, disconnect).

Plus api.stream_max_concurrent_replays added to the config interface, Zod schema, and config.ref.json; the sample stream_scroll_limit changed from -1 to 50000.

Operator note

Existing configs with stream_scroll_limit: -1 keep unlimited per-replay scrolls (now logged), but the concurrency cap applies regardless (defaults to 4), so deploying this immediately bounds parallel cold-tier walks. Operators should also set a finite stream_scroll_limit.

Testing

tsc --noEmit clean; bun test tests/unit → 121 pass. No change to live streaming or to the REST endpoints.

A stream subscription with a past start_from scrolls <chain>-<type>-* (every index, including the oldest cold-tier shards). Three hardening guards in streamPastCommon: - Concurrency cap: bound in-flight history replays per API process (api.stream_max_concurrent_replays, default 4) with backpressure, so a reconnect storm or many simultaneous deep replays can't spawn parallel full-history scrolls onto the cold tier. - Finite scroll cap by default: stream_scroll_limit now defaults to 50000 when unset (was effectively unlimited for both unset and -1). -1 still means unlimited but logs a loud warning past 100k docs. - Guaranteed scroll cleanup: the scroll context is now released in a finally block on every exit path (success, scroll-limit reject, ack timeout, NACK, disconnect). Previously those early returns leaked the scroll, pinning old/cold segments until the keepalive expired. Adds api.stream_max_concurrent_replays to the config schema + reference.

gemini-code-assist

Code Review

This pull request introduces hardening measures for streaming history replays to prevent overloading cold-tier shards. It implements concurrency limits on history replays per API process, caps the default scroll limit, and ensures that Elasticsearch scroll contexts are reliably cleared using a try...finally block. A critical issue was identified in the scroll loop where an empty hits array could trigger an infinite loop, and a code suggestion was provided to break the loop safely.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T11:27:10Z

+            if (rp) {
+
+                currentScrollId = rp._scroll_id;


If the Elasticsearch scroll returns an empty page of hits (rp.hits.hits.length === 0) before counter exactly matches getTotalValue(rp) (which can happen due to document updates, index changes, or if track_total_hits is capped), the loop will continue to request more scroll pages indefinitely. This results in an infinite loop that will consume 100% CPU and spam Elasticsearch with scroll requests. Adding a check to break the loop when no hits are returned prevents this critical failure mode.

Suggested change

if (rp) {

currentScrollId = rp._scroll_id;

if (rp) {

if (!rp.hits?.hits || rp.hits.hits.length === 0) {

hLog(`${counter} past ${dataKind}s streamed to ${socket.id} (${totalFiltered} filtered)`);

break;

}

currentScrollId = rp._scroll_id;

Copilot

Pull request overview

Hardens the streaming “history replay” path (streamPastCommon) to prevent unbounded cold-tier index scans from overwhelming Elasticsearch when clients request deep replays and/or reconnect rapidly.

Changes:

Adds a per-process concurrency cap for history replays (api.stream_max_concurrent_replays, default 4) with backpressure messaging.
Makes history replay size bounded by default (stream_scroll_limit defaults to 50k when unset; -1 remains unbounded but emits a warning past 100k).
Ensures Elasticsearch scroll contexts are always cleared via finally to prevent pinning old/cold segments.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/interfaces/hyperionConfig.ts`	Adds config typing and Zod schema support for `stream_max_concurrent_replays` and documents `stream_scroll_limit` default semantics.
`src/api/helpers/functions.ts`	Implements replay concurrency limiting, default scroll caps + warnings for unbounded replays, and guarantees scroll cleanup in `finally`.
`references/config.ref.json`	Updates reference defaults (`stream_scroll_limit: 50000`) and adds `stream_max_concurrent_replays: 4`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

                } else {
-                    filterCount++;
+                    hLog(`LOST CLIENT During ${dataKind.toUpperCase()} history replay!`);
+                    break;


+    const maxConcurrent = fastify.manager.config.api.stream_max_concurrent_replays ?? DEFAULT_MAX_CONCURRENT_REPLAYS;
+    if (activeHistoryReplays >= maxConcurrent) {


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c640b83fda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T11:28:21Z

+        const esQuery = {
+            index: fastify.manager.chain + `-${dataKind}-*`,
+            scroll: `${MAX_SCROLL_TIME_SEC}s`,
+            size: fastify.manager.config.api.stream_scroll_batch || 500,
+            ...search_body


Track total hits before enforcing replay caps

When a history query matches more than Elasticsearch's default 10k total-hit tracking window, getTotalValue() sees only 10000 with relation gte. Because the new 50k cap is enforced only from that initial total and the loop also stops when getTotalValue(rp) === counter, a replay with, for example, 60k matching docs will neither be rejected at the configured 50k limit nor stream up to 50k; it silently ends after 10k, and stream_scroll_limit=-1 also never reaches the new warning threshold. Set track_total_hits here or enforce the cap based on scroll batches instead of the capped total.

Useful? React with 👍 / 👎.

- CRITICAL (gemini): break the scroll loop on an empty hits page. Relying solely on counter === total could loop forever scrolling empty pages (deletes/merges mid-scroll, or a capped total counter never reaches) — pegging CPU and spamming ES. - Track total hits to (cap + 1) on the replay query (codex): ES only tracks 10k by default, so a 60k-doc replay would slip a 50k cap and silently truncate at 10k. Now the cap is enforced accurately and replays stream to their true end. - Return a failure status on client disconnect (copilot): previously returned status:true, so the caller launched up to 3 follow-up 'fill' replays scrolling the cold tier for a dead socket. - Normalize stream_max_concurrent_replays against NaN/negative (copilot) so a bad config can't silently disable the cap.

igorls · 2026-06-04T11:32:31Z

Thanks — all four addressed in a620e73:

CRITICAL (gemini) — infinite empty-scroll loop: the loop now breaks on an empty hits page (!rp.hits?.hits || length === 0), the authoritative scroll terminator. Relying on counter === total alone could spin forever (deletes/merges mid-scroll, or a capped total counter never lands on) — pegging CPU and spamming ES, the exact failure this PR exists to prevent. Good catch.
(codex) — track_total_hits defeats the cap: ES tracked only 10k by default, so a 60k-doc replay slipped the 50k cap and silently truncated at 10k. The replay query now sets track_total_hits to cap + 1 (or warn + 1 for -1), so the cap is enforced accurately and in-cap replays stream to their true end.
(copilot) — disconnect returned success: on LOST CLIENT it now returns status: false, so the caller stops instead of launching its 3 follow-up "fill" replays against the cold tier for a dead socket.
(copilot) — config hardening: stream_max_concurrent_replays is normalized to a finite ≥1 value so a NaN/negative config can't silently remove the cap.

tsc clean; 121 unit tests pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a620e7319c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T11:35:00Z

+        const trackTotalHits = scrollLimit === -1
+            ? UNBOUNDED_REPLAY_WARN_THRESHOLD + 1
+            : scrollLimit + 1;


Avoid treating the unlimited warning threshold as completion

When api.stream_scroll_limit is explicitly -1, this search only tracks hits up to 100001; if the real replay is larger and stream_scroll_batch divides that threshold (for example batch 1, 11, or 100001), getTotalValue(rp) === counter becomes true at the tracked lower bound and the supposedly unlimited replay stops early. The fresh evidence is this new capped track_total_hits value for the -1 path; completion needs to distinguish an exact total from a gte lower bound or avoid a numeric cap for unlimited replays.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings June 4, 2026 11:25

Copilot started reviewing on behalf of igorls June 4, 2026 11:25 View session

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

Copilot AI reviewed Jun 4, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

igorls merged commit 13651ee into dev Jun 4, 2026
2 checks passed

igorls deleted the harden/stream-replay branch June 4, 2026 11:34

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): harden stream history replay against unbounded cold-tier walks#178

fix(api): harden stream history replay against unbounded cold-tier walks#178
igorls merged 2 commits into
devfrom
harden/stream-replay

igorls commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

igorls commented Jun 4, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const maxConcurrent = fastify.manager.config.api.stream_max_concurrent_replays ?? DEFAULT_MAX_CONCURRENT_REPLAYS;
		if (activeHistoryReplays >= maxConcurrent) {

Conversation

igorls commented Jun 4, 2026

Problem

Changes (all in streamPastCommon)

Operator note

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

igorls commented Jun 4, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changes (all in `streamPastCommon`)