fix(decopilot): keep chat streams alive and recoverable across disconnects by viktormarinho · Pull Request #3316 · decocms/studio

viktormarinho · 2026-05-08T14:32:23Z

What is this contribution about?

Fixes the frequent "network error" banner during chat streaming and the related "thread stuck in_progress forever" symptom on self-hosted deploys, especially during deep research. Three root causes were addressed:

The AI SDK's JsonToSseTransformStream emits no SSE keepalive bytes during silent gaps (long tool calls, deep-research polling), so reverse proxies (nginx ingress, Istio, ALB — all 15–60s read timeouts) cut the connection. Added a defensive wrapWithSseKeepalive that injects : keepalive\n\n comment lines every 15s, only on event boundaries, with full cleanup on close/error/cancel.
claimOrphanedRun excluded same-pod ownership, so on single-pod self-hosted deploys a run that left the in-memory registry while run_owner_pod still pointed at the live pod was permanently unrecoverable. Removed the exclusion (callers already verify runRegistry.isRunning); same fix applied to listOrphanedRuns for K8s StatefulSet rolling-restart recovery.
MAX_RUN_AGE_MS of 30 minutes prematurely reaped legitimate deep-research runs while the provider job kept burning credits — bumped to 90 minutes.
Plus client-side: chatError is now cleared when silent resume succeeds and at the start of a new send, so transient blips don't leave a sticky "network error" toast.

How to Test

With bun run dev, start a long deep-research chat and let it run. Confirm the response continues to flow (heartbeats are invisible to the parser; verify via DevTools → Network → response stream payload includes : keepalive lines).
To simulate a proxy disconnect: kill the dev server mid-stream, restart, revisit the thread — the orphan resume path should attach successfully via /decopilot/attach/:threadId instead of leaving the thread in_progress.
To verify same-pod recovery: with POD_NAME set, manually mark a thread status=in_progress, run_owner_pod=<current pod> while the registry is empty (e.g. between requests), revisit the thread — orphan resume should now claim it.
bun test apps/mesh/src — 1506 passing.

Migration Notes

None. No schema or config changes.

Review Checklist

PR title is clear and descriptive
Changes are tested and working (1506 mesh tests pass; new unit tests cover the keepalive boundary, cancel propagation, leak prevention; new test covers same-pod claim)
Documentation is updated (if needed) — N/A
No breaking changes

Summary by cubic

Keeps chat SSE streams alive and recoverable across proxy disconnects, fixing the “network error” banner and threads stuck in_progress on self‑hosted. Adds 15s SSE keepalives, enables same‑pod orphan recovery, and clears stale error toasts.

Bug Fixes
- Injects : keepalive comments every 15s on SSE responses (only at event boundaries) to prevent reverse‑proxy read timeouts.
- Allows same‑pod claims in orphan recovery and includes same‑pod runs in startup listing to handle StatefulSet restarts and single‑pod deploys.
- Clears chat error on successful silent resume and when starting a new message to avoid sticky “network error” toasts.

^{Written for commit 6721ccf. Summary will update on new commits.}

…nects Adds 15s SSE keepalive comments to chat stream responses so reverse-proxy read timeouts (nginx ingress, Istio, ALB) don't kill long deep-research runs mid-flight. Fixes claimOrphanedRun excluding same-pod ownership, which pinned threads in_progress forever on single-pod self-hosted deploys when a run left the in-memory registry while run_owner_pod still pointed at the live pod. Bumps MAX_RUN_AGE_MS 30→90min so deep research isn't reaped while the provider job keeps burning credits, and clears the chat-side error banner on silent resume / new send so transient blips don't leave a sticky "network error" toast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-08T14:32:38Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-05-08T14:32:40Z

Release Options

Suggested: Patch (2.310.12) — based on fix: prefix

React with an emoji to override the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.310.12-alpha.1`
🎉	Patch	`2.310.12`
❤️	Minor	`2.311.0`
🚀	Major	`3.0.0`

Current version: 2.310.11

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

revert(decopilot): keep MAX_RUN_AGE_MS at 30 minutes

6721ccf

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pedrofrxncx approved these changes May 8, 2026

View reviewed changes

viktormarinho merged commit 3c16183 into main May 8, 2026
15 of 16 checks passed

viktormarinho deleted the viktormarinho/network-error-debug branch May 8, 2026 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(decopilot): keep chat streams alive and recoverable across disconnects#3316

fix(decopilot): keep chat streams alive and recoverable across disconnects#3316
viktormarinho merged 2 commits intomainfrom
viktormarinho/network-error-debug

viktormarinho commented May 8, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

viktormarinho commented May 8, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

github-actions Bot commented May 8, 2026

🧪 Benchmark

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Options

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

viktormarinho commented May 8, 2026 •

edited by cubic-dev-ai Bot

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading