Skip to content

fix(webapp): stop replica lag from double-triggering session runs and 404ing fresh sessions#3914

Open
ericallam wants to merge 2 commits into
mainfrom
fix/session-replica-races
Open

fix(webapp): stop replica lag from double-triggering session runs and 404ing fresh sessions#3914
ericallam wants to merge 2 commits into
mainfrom
fix/session-replica-races

Conversation

@ericallam

@ericallam ericallam commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Two read-replica races on the session APIs could break chats whose first activity lands inside the replication window (or any time the replica lags):

  1. A session's first .in append or .out subscribe could fail with a 404 for a session that exists on the writer, because the route resolved the Session row on the replica only.
  2. ensureRunForSession probed run liveness on the replica, so a probe miss on a run triggered moments earlier was judged "run is dead" and a second live run was spawned for the same session. Both runs then consumed the same input stream, producing duplicated turns and doubled responses (and doubled LLM cost).

Fix

Liveness now re-probes the writer before declaring the current run dead (the old code already fell back to the writer, but only to recover the friendlyId, after the wrong verdict was made). Session resolution on the append and subscribe/init routes goes through a new resolveSessionWithWriterFallback, which stays replica-first on the hot path and only touches the writer on a miss.

Reproduced and verified against a local streaming replica with an artificial apply delay: pre-fix, a send immediately after session creation reliably produced either the 404 or two executing runs with a doubled response; post-fix, the same flow produces exactly one run and one response.

Also rides along: the local docker replica's default apply delay drops from 150ms to a realistic 20ms (override via REPLICA_APPLY_DELAY when you want to deliberately widen the race window).

… 404ing fresh sessions

ensureRunForSession probed run liveness on the read replica, so a probe
miss on a just-triggered run was judged dead and a second live run was
spawned for the same session, duplicating turns and responses. The
append and subscribe/init session routes also resolved the Session row
on the replica only, failing a fresh session's first append or
subscribe inside the replication window. Liveness now re-probes the
writer before declaring a run dead, and session resolution on those
routes falls back to the writer on a replica miss.
@changeset-bot

changeset-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: c0ccb24

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3f9ab13a-0640-4504-a3c3-cb0ae8545441

📥 Commits

Reviewing files that changed from the base of the PR and between 8856bc6 and c0ccb24.

📒 Files selected for processing (1)
  • docker/docker-compose.yml
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: 🛡️ E2E Auth Tests (full)
🔇 Additional comments (2)
docker/docker-compose.yml (2)

61-62: LGTM!


76-76: LGTM!


Walkthrough

This PR fixes read-replica race conditions affecting session APIs by introducing a replica-first resolver with writer fallback and updating run liveness detection. The core change adds resolveSessionWithWriterFallback to session resolution logic, allowing realtime routes to tolerate replica lag when looking up sessions. Run probe logic is also enhanced to explicitly re-probe the writer when the replica misses, preventing incorrect "dead run" detection. The changes resolve two behavioral issues: initial session appends no longer return 404, and session runs created just before replica updates are no longer treated as dead, preventing duplicate run triggers and chat response duplication.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the problem statement, implementation details, and verification approach, but is missing the required checklist items, testing steps section, and changelog section from the template. Add the missing template sections: checklist with contribution guide verification, explicit testing steps section, changelog section, and screenshots section if applicable.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: fixing replica lag issues that cause duplicate session runs and 404 errors on fresh sessions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/session-replica-races

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

150ms is useful for deliberately shaking out replica races but is an
order of magnitude above typical streaming-replication lag; 20ms keeps
the local replica honest by default. Override via REPLICA_APPLY_DELAY.
@ericallam ericallam marked this pull request as ready for review June 12, 2026 11:36

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant