Skip to content

fix(decopilot): keep chat streams alive and recoverable across disconnects#3316

Merged
viktormarinho merged 2 commits intomainfrom
viktormarinho/network-error-debug
May 8, 2026
Merged

fix(decopilot): keep chat streams alive and recoverable across disconnects#3316
viktormarinho merged 2 commits intomainfrom
viktormarinho/network-error-debug

Conversation

@viktormarinho
Copy link
Copy Markdown
Contributor

@viktormarinho viktormarinho commented May 8, 2026

What is this contribution about?

Fixes the frequent "network error" banner during chat streaming and the related "thread stuck in_progress forever" symptom on self-hosted deploys, especially during deep research. Three root causes were addressed:

  • The AI SDK's JsonToSseTransformStream emits no SSE keepalive bytes during silent gaps (long tool calls, deep-research polling), so reverse proxies (nginx ingress, Istio, ALB — all 15–60s read timeouts) cut the connection. Added a defensive wrapWithSseKeepalive that injects : keepalive\n\n comment lines every 15s, only on event boundaries, with full cleanup on close/error/cancel.
  • claimOrphanedRun excluded same-pod ownership, so on single-pod self-hosted deploys a run that left the in-memory registry while run_owner_pod still pointed at the live pod was permanently unrecoverable. Removed the exclusion (callers already verify runRegistry.isRunning); same fix applied to listOrphanedRuns for K8s StatefulSet rolling-restart recovery.
  • MAX_RUN_AGE_MS of 30 minutes prematurely reaped legitimate deep-research runs while the provider job kept burning credits — bumped to 90 minutes.
  • Plus client-side: chatError is now cleared when silent resume succeeds and at the start of a new send, so transient blips don't leave a sticky "network error" toast.

How to Test

  1. With bun run dev, start a long deep-research chat and let it run. Confirm the response continues to flow (heartbeats are invisible to the parser; verify via DevTools → Network → response stream payload includes : keepalive lines).
  2. To simulate a proxy disconnect: kill the dev server mid-stream, restart, revisit the thread — the orphan resume path should attach successfully via /decopilot/attach/:threadId instead of leaving the thread in_progress.
  3. To verify same-pod recovery: with POD_NAME set, manually mark a thread status=in_progress, run_owner_pod=<current pod> while the registry is empty (e.g. between requests), revisit the thread — orphan resume should now claim it.
  4. bun test apps/mesh/src — 1506 passing.

Migration Notes

None. No schema or config changes.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working (1506 mesh tests pass; new unit tests cover the keepalive boundary, cancel propagation, leak prevention; new test covers same-pod claim)
  • Documentation is updated (if needed) — N/A
  • No breaking changes

Summary by cubic

Keeps chat SSE streams alive and recoverable across proxy disconnects, fixing the “network error” banner and threads stuck in_progress on self‑hosted. Adds 15s SSE keepalives, enables same‑pod orphan recovery, and clears stale error toasts.

  • Bug Fixes
    • Injects : keepalive comments every 15s on SSE responses (only at event boundaries) to prevent reverse‑proxy read timeouts.
    • Allows same‑pod claims in orphan recovery and includes same‑pod runs in startup listing to handle StatefulSet restarts and single‑pod deploys.
    • Clears chat error on successful silent resume and when starting a new message to avoid sticky “network error” toasts.

Written for commit 6721ccf. Summary will update on new commits.

…nects

Adds 15s SSE keepalive comments to chat stream responses so reverse-proxy
read timeouts (nginx ingress, Istio, ALB) don't kill long deep-research
runs mid-flight. Fixes claimOrphanedRun excluding same-pod ownership,
which pinned threads in_progress forever on single-pod self-hosted deploys
when a run left the in-memory registry while run_owner_pod still pointed
at the live pod. Bumps MAX_RUN_AGE_MS 30→90min so deep research isn't
reaped while the provider job keeps burning credits, and clears the
chat-side error banner on silent resume / new send so transient blips
don't leave a sticky "network error" toast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Release Options

Suggested: Patch (2.310.12) — based on fix: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.310.12-alpha.1
🎉 Patch 2.310.12
❤️ Minor 2.311.0
🚀 Major 3.0.0

Current version: 2.310.11

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viktormarinho viktormarinho merged commit 3c16183 into main May 8, 2026
15 of 16 checks passed
@viktormarinho viktormarinho deleted the viktormarinho/network-error-debug branch May 8, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants