fix(webapp): stop replica lag from double-triggering session runs and 404ing fresh sessions#3914
fix(webapp): stop replica lag from double-triggering session runs and 404ing fresh sessions#3914ericallam wants to merge 2 commits into
Conversation
… 404ing fresh sessions ensureRunForSession probed run liveness on the read replica, so a probe miss on a just-triggered run was judged dead and a second live run was spawned for the same session, duplicating turns and responses. The append and subscribe/init session routes also resolved the Session row on the replica only, failing a fresh session's first append or subscribe inside the replication window. Liveness now re-probes the writer before declaring a run dead, and session resolution on those routes falls back to the writer on a replica miss.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
🔇 Additional comments (2)
WalkthroughThis PR fixes read-replica race conditions affecting session APIs by introducing a replica-first resolver with writer fallback and updating run liveness detection. The core change adds 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
150ms is useful for deliberately shaking out replica races but is an order of magnitude above typical streaming-replication lag; 20ms keeps the local replica honest by default. Override via REPLICA_APPLY_DELAY.
Summary
Two read-replica races on the session APIs could break chats whose first activity lands inside the replication window (or any time the replica lags):
.inappend or.outsubscribe could fail with a 404 for a session that exists on the writer, because the route resolved the Session row on the replica only.ensureRunForSessionprobed run liveness on the replica, so a probe miss on a run triggered moments earlier was judged "run is dead" and a second live run was spawned for the same session. Both runs then consumed the same input stream, producing duplicated turns and doubled responses (and doubled LLM cost).Fix
Liveness now re-probes the writer before declaring the current run dead (the old code already fell back to the writer, but only to recover the friendlyId, after the wrong verdict was made). Session resolution on the append and subscribe/init routes goes through a new
resolveSessionWithWriterFallback, which stays replica-first on the hot path and only touches the writer on a miss.Reproduced and verified against a local streaming replica with an artificial apply delay: pre-fix, a send immediately after session creation reliably produced either the 404 or two executing runs with a doubled response; post-fix, the same flow produces exactly one run and one response.
Also rides along: the local docker replica's default apply delay drops from 150ms to a realistic 20ms (override via
REPLICA_APPLY_DELAYwhen you want to deliberately widen the race window).