fix(decopilot): keep chat streams alive and recoverable across disconnects#3316
Merged
viktormarinho merged 2 commits intomainfrom May 8, 2026
Merged
fix(decopilot): keep chat streams alive and recoverable across disconnects#3316viktormarinho merged 2 commits intomainfrom
viktormarinho merged 2 commits intomainfrom
Conversation
…nects Adds 15s SSE keepalive comments to chat stream responses so reverse-proxy read timeouts (nginx ingress, Istio, ALB) don't kill long deep-research runs mid-flight. Fixes claimOrphanedRun excluding same-pod ownership, which pinned threads in_progress forever on single-pod self-hosted deploys when a run left the in-memory registry while run_owner_pod still pointed at the live pod. Bumps MAX_RUN_AGE_MS 30→90min so deep research isn't reaped while the provider job keeps burning credits, and clears the chat-side error banner on silent resume / new send so transient blips don't leave a sticky "network error" toast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Contributor
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pedrofrxncx
approved these changes
May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this contribution about?
Fixes the frequent "network error" banner during chat streaming and the related "thread stuck in_progress forever" symptom on self-hosted deploys, especially during deep research. Three root causes were addressed:
JsonToSseTransformStreamemits no SSE keepalive bytes during silent gaps (long tool calls, deep-research polling), so reverse proxies (nginx ingress, Istio, ALB — all 15–60s read timeouts) cut the connection. Added a defensivewrapWithSseKeepalivethat injects: keepalive\n\ncomment lines every 15s, only on event boundaries, with full cleanup on close/error/cancel.claimOrphanedRunexcluded same-pod ownership, so on single-pod self-hosted deploys a run that left the in-memory registry whilerun_owner_podstill pointed at the live pod was permanently unrecoverable. Removed the exclusion (callers already verifyrunRegistry.isRunning); same fix applied tolistOrphanedRunsfor K8s StatefulSet rolling-restart recovery.MAX_RUN_AGE_MSof 30 minutes prematurely reaped legitimate deep-research runs while the provider job kept burning credits — bumped to 90 minutes.chatErroris now cleared when silent resume succeeds and at the start of a new send, so transient blips don't leave a sticky "network error" toast.How to Test
bun run dev, start a long deep-research chat and let it run. Confirm the response continues to flow (heartbeats are invisible to the parser; verify via DevTools → Network → response stream payload includes: keepalivelines)./decopilot/attach/:threadIdinstead of leaving the threadin_progress.POD_NAMEset, manually mark a threadstatus=in_progress, run_owner_pod=<current pod>while the registry is empty (e.g. between requests), revisit the thread — orphan resume should now claim it.bun test apps/mesh/src— 1506 passing.Migration Notes
None. No schema or config changes.
Review Checklist
Summary by cubic
Keeps chat SSE streams alive and recoverable across proxy disconnects, fixing the “network error” banner and threads stuck in_progress on self‑hosted. Adds 15s SSE keepalives, enables same‑pod orphan recovery, and clears stale error toasts.
: keepalivecomments every 15s on SSE responses (only at event boundaries) to prevent reverse‑proxy read timeouts.Written for commit 6721ccf. Summary will update on new commits.