Skip to content

🤖 perf: smooth text streaming (kill cascade re-renders, model-aware reveal)#3219

Merged
ammario merged 6 commits intomainfrom
streaming-r08f
May 2, 2026
Merged

🤖 perf: smooth text streaming (kill cascade re-renders, model-aware reveal)#3219
ammario merged 6 commits intomainfrom
streaming-r08f

Conversation

@ammar-agent
Copy link
Copy Markdown
Collaborator

@ammar-agent ammar-agent commented May 2, 2026

Summary

Streamed assistant text (and reasoning) was visibly jittery — periodic catch-up jumps every few seconds, rate stuck at ~72 chars/sec regardless of what the model emitted, and a sub-frame of work for the entire chat list on every delta. This PR makes the cadence smooth in three ordered fixes plus a TPS-display fix discovered during review: leaf-subscribe the streaming-stats pill so it stops invalidating WorkspaceState, replace the smoothing engine's hard-snap with a model-aware soft catch-up, compact streaming parts on append, and floor the TPS calculator's time span so a new stream's first deltas don't spike the displayed rate.

Background

The renderer has had a two-clock smoothing model (SmoothTextEngine + useSmoothStreamingText) for a while, but several regressions defeated it:

  1. WorkspaceState.streamingTokenCount / streamingTPS were computed inside the getWorkspaceState snapshot using Date.now(). Every coalesced delta produced a new snapshot reference, which cascaded WorkspaceShell → ChatPane → MessageRenderer through every row. useDeferredValue was bypassed for the entire stream by shouldBypassDeferredMessages, so reconciliation ran at the ingestion rate.
  2. getAdaptiveRate(backlog) ignored the model's actual emission rate. With a fast model (~120 cps) and BASE_CHARS_PER_SEC=72, the visible cursor fell behind by ~5 chars per ingestion cycle until backlog crossed MAX_VISUAL_LAG_CHARS=120, at which point enforceMaxVisualLag snapped visible := full - 120 and zeroed the budget — that snap is exactly the visible "catch-up jump".
  3. requestIdleCallback({ timeout: 100 }) was used for streaming deltas. The smoothing engine should be the only pacing layer; idle batching just feeds (2).
  4. handleStreamDelta appended a fresh { type: "text" } part per chunk; mergeAdjacentParts re-merged on every render. For a 10k-char reply that's tens of thousands of merges per turn.
  5. calculateTPS divided by now - firstDelta.timestamp. With one delta that span is typically a few milliseconds, so e.g. 50 tokens / 0.005s = 10000 t/s. Phase 1's microtask cadence exposed this — where the prior idle-callback batching used to mask it by sampling later — and Phase 2 wired TPS into the smoothing engine, amplifying its visibility.

Implementation

Four commits, ordered so each phase is verifiable in isolation:

Phase 1 — leaf-subscribe streaming stats, microtask ingestion (775e9023c)

  • Removed streamingTokenCount / streamingTPS from WorkspaceState.
  • Added WorkspaceStreamingStats + streamingStatsStore (MapStore) + useWorkspaceStreamingStats(workspaceId) leaf hook (mirrors the existing useWorkspaceStatsSnapshot pattern at WorkspaceStore.ts:4127).
  • Replaced scheduleIdleStateBump with scheduleStreamingStateBump for streaming delta types (stream-delta, tool-call-delta, reasoning-delta). It coalesces on queueMicrotask instead of an idle callback. init-output and bash-output keep the idle path (terminal-style throughput).
  • Wired cancelPendingStreamingBump into stream-end / stream-abort / replay reset / removeWorkspace.
  • StreamingBarrier now reads via the leaf hook.

Phase 2 — model-aware smoothing engine, soft catch-up (85fb141da)

  • SmoothTextEngine.update() accepts an optional liveCharsPerSec. getAdaptiveRate(backlog, liveCps) combines a steady-state floor (max(BASE, liveCps)), a soft catch-up ramp that drains lag over SOFT_CATCHUP_DRAIN_MS once it exceeds SOFT_CATCHUP_LAG_CHARS=60, and the legacy backlog-pressure ramp (kept as upper bound).
  • Replaced the hard-snap discontinuity with the soft ramp. MAX_VISUAL_LAG_CHARS is now 1024 (was 120) — a defensive safety net for paused-tab pathological bursts that normal streams never hit.
  • Bumped MIN_FRAME_CHARS from 1 to 2 so reveals coalesce to ~30 Hz at the BASE rate (half the markdown re-parse cost; humans can't see the difference). Tail-end reveal still works because the gate is now min(MIN_FRAME_CHARS, backlog).
  • useSmoothStreamingText and TypewriterMarkdown thread liveCharsPerSec through; TypewriterMarkdown accepts a new workspaceId prop, forwarded from AssistantMessage and ReasoningMessage (via MessageRenderer).

Phase 3 — compact-on-append, clean prop surface (0a945ed7b)

  • StreamingMessageAggregator.handleStreamDelta / handleReasoningDelta append into the previous adjacent text/reasoning part in place. For a 10k-char reply this drops parts.length from thousands to one and mergeAdjacentParts cost from O(N) to O(1). Backend persistence (partial.json, chat.jsonl) is unaffected — those writers live backend-side; this aggregator's parts is pure display state.
  • TypewriterMarkdown: dropped the deltas: string[] shape (always passed as [content] literal — defeated React.memo) for content: string. Removed the manual React.memo and the inner useMemo for the streaming-context value (React Compiler handles both).

Phase 4 — TPS calculator floor + stream-error token cleanup (a476613be)

  • calculateTPS now floors the divisor at MIN_TPS_TIME_SPAN_MS = 1000. With one delta the rate becomes tokens / 1s instead of tokens / 0.005s. The reported TPS smoothly ramps up over the first second of a stream instead of spiking and "dropping abruptly". Slight under-statement during the settling window is the trade-off — strictly preferable to an order-of-magnitude over-statement.
  • The stream-error branch in applyWorkspaceChatEventToAggregator now calls clearTokenState, matching stream-end and stream-abort. Without it, the errored message's deltaHistory entry leaks into a follow-up stream's TPS calculation.

Validation

  • make typecheck
  • make lint
  • Targeted streaming surface: 1009+ tests pass / 0 fail across SmoothTextEngine, useSmoothStreamingText, StreamingMessageAggregator, applyWorkspaceChatEventToAggregator, StreamingTPSCalculator, TypewriterMarkdown, ReasoningMessage, StreamingBarrier{,View}, PinnedTodoList, WorkspaceStore, plus the broader src/browser/utils/messages/, src/browser/features/Messages/, src/browser/stores/, and src/browser/hooks/ suites.
  • New behavioral tests:
    • SmoothTextEngine.test.ts: rate tracks liveCharsPerSec; soft catch-up engaged for 60–1024 char lags without snap; hard snap still fires above the safety threshold.
    • StreamingTPSCalculator.test.ts: 1s floor applied for tiny / zero spans; raw span used once it exceeds the floor; negative spans (clock skew) return 0.
    • applyWorkspaceChatEventToAggregator.test.ts: stream-error calls clearTokenState.

Risks

Localized to the streaming display path; no protocol or persistence changes.

  • Re-render shape (Phase 1). Streaming deltas now bump WorkspaceState once per microtask drain instead of once per requestIdleCallback. Net effect under heavy load is less work because the snapshot stops invalidating per-delta TPS, but it's a behavioral shift — verified via the existing 106-test WorkspaceStore suite plus targeted StreamingBarrier tests.
  • Smoothing engine constants (Phase 2). MAX_VISUAL_LAG_CHARS jumped 120 → 1024 and MIN_FRAME_CHARS 1 → 2. Existing test "caps visual lag when incoming text jumps ahead" still passes against the new soft-ramp behavior, and the new "hard-snaps when lag exceeds the safety threshold" test confirms the safety net still functions.
  • Compact-on-append (Phase 3). Touches the in-memory parts array shape during streaming. The aggregator already had compaction at stream-end (compactMessageParts); we're just doing it eagerly. No on-disk format change. All StreamingMessageAggregator and applyWorkspaceChatEventToAggregator tests pass.
  • TPS floor (Phase 4). The reported rate during the first second of a stream now under-counts versus the previous (mathematically broken) value. Backend sessionTimingService also calls calculateTPS; same floor applies there but the backend's window is broader so the visible effect is smaller. No risk to persisted usage / cost calculations — those use usage.outputTokens / duration from the API, not the streaming TPS estimator.

Generated with mux • Model: anthropic:claude-opus-4-7 • Thinking: xhigh • Cost: $23.55

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a945ed7bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/browser/features/Messages/TypewriterMarkdown.tsx Outdated
Streaming TPS pill could briefly show inflated values at the start of a
new stream then 'drop abruptly' to a real value, looking like cached
stale data. Two root causes:

1. calculateTPS divided by (now - firstDelta.timestamp). For a brand
   new stream's first delta that span is just a few ms — e.g.
   '50 tokens / 0.005s = 10000 t/s'. As more deltas accumulate the
   window broadens and TPS settles, hence the visible drop. Phase 1's
   microtask cadence exposed this where the prior idle-callback batching
   used to mask it. Floor the divisor to a 1s minimum window so the rate
   smoothly ramps up over the first second of a stream instead of
   spiking. Underestimation during the settling window is acceptable;
   order-of-magnitude overestimation isn't.

2. The stream-error event handler in applyWorkspaceChatEventToAggregator
   didn't call clearTokenState, leaving the errored message's
   deltaHistory entry to leak. Match stream-end / stream-abort and clear
   it so a follow-up stream starts with a clean slate.

Adds tests for both behaviors.
Codex P1 on PR #3219: every TypewriterMarkdown instance was subscribing
to useWorkspaceStreamingStats(workspaceId) regardless of streaming
state. Long transcripts of completed assistant messages then re-rendered
on every stream-delta of the new live message, undoing part of the
cascade-rerender fix.

Subscribe to the real workspace key only while the message is actively
streaming; completed messages subscribe to the empty-string sentinel
which is never bumped. Hook still runs unconditionally per rules of
hooks — only the key changes.
@ammar-agent
Copy link
Copy Markdown
Collaborator Author

@codex review

Addressed your P1: TypewriterMarkdown now subscribes to useWorkspaceStreamingStats with the real workspace key only while isStreaming is true. Completed historical messages subscribe to the stable empty-string sentinel (which is never bumped), so a long transcript of finished assistant messages no longer re-renders on every stream-delta of a new active stream.

Added regression tests in TypewriterMarkdown.test.tsx covering both branches (completed → empty key, streaming → real key).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff12fced30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/browser/stores/WorkspaceStore.ts
Codex P2 on PR #3219: streamingStatsStore was bumped on stream-end and
stream-abort but not on stream-error. Subscribers (TypewriterMarkdown
via useWorkspaceStreamingStats) could keep returning the failed
stream's TPS/charsPerSec until the next delta arrived, leaking stale
rates into the next stream's early renders.

Mirror the stream-end / stream-abort terminal-cleanup pattern in the
stream-error path: cancel any pending coalesced bumps, then bump
streamingStatsStore so consumers re-read and the snapshot collapses to
null (getActiveStreamMessageId is already undefined post-error).

Adds a regression test that drives stream-start + stream-delta +
caught-up to populate the cache, then asserts both that subscribers
are notified and that the post-error snapshot is null.
@ammar-agent
Copy link
Copy Markdown
Collaborator Author

@codex review

Addressed your P2: stream-error now mirrors the stream-end/stream-abort terminal-cleanup pattern — cancels pending coalesced bumps and explicitly bumps streamingStatsStore so subscribers re-read and the snapshot collapses to null (since getActiveStreamMessageId is already cleared by handleStreamError). Added a regression test in WorkspaceStore.test.ts that drives stream-startstream-deltacaught-up to populate the cache, then asserts both that subscribers are notified and that the post-error snapshot is null.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ammario ammario merged commit af4912e into main May 2, 2026
24 checks passed
@ammario ammario deleted the streaming-r08f branch May 2, 2026 03:32
ammar-agent added a commit that referenced this pull request May 2, 2026
Upstream regression: @smithy/util-retry@4.3.7 was published at
2026-05-02T04:09:26Z with workspace:^ refs that escaped the smithy
monorepo. Any lockfile-free bun install (used by both
scripts/check-bench-agent.sh and scripts/smoke-test.sh — which mimic
'bun x mux@latest' for end users) fails to resolve those refs.

Reproduction outside this repo:
  mkdir t && cd t
  echo '{"dependencies":{"@aws-sdk/credential-providers":"^3.940.0"}}' > package.json
  bun install --ignore-scripts
  # → 'Workspace dependency "@smithy/types" not found'

This breaks every PR opened/pushed after 04:09Z (PR #3219 was lucky to
merge ~30 min before). Add an npm-style overrides pin to <=4.3.6 (the
last known-good release, 2026-04-28) until smithy republishes.

Verified locally: 'bash scripts/check-bench-agent.sh' now passes, and
@smithy/util-retry resolves to 4.3.6 in a fresh lockfile-free install.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants