[pull] main from triggerdotdev:main#208
Merged
Merged
Conversation
…when the read replica lags (#3889) ## Summary When `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` is on, `RunEngine.getSnapshotsSince` reads from the read replica. During write spikes the replica can briefly lag, so the snapshot id a runner just learned from the writer isn't visible there yet: the lookup threw, the worker route returned a 500, and the runner waited for its next poll — turning sub-second snapshot notifications into poll-interval latency exactly when things are busiest. This PR makes the flag safe to enable: a replica miss of the since snapshot gets one jittered retry on the replica (most lag windows are shorter than the ~50–200ms wait, so the writer is never touched), then falls back to the primary, observed via a new `run_engine.snapshots_since.replica_miss` counter with an `outcome` attribute (`replica_retry` vs `primary`). Only genuine misses — absent on the primary too — remain errors. ## Design - `getExecutionSnapshotsSince` now throws a typed `ExecutionSnapshotNotFoundError` so the engine can distinguish the expected lag miss from real failures. The message string is unchanged and the error never leaves the engine. - The recovery path only engages when the flag is on, a distinct replica client is configured, and no transaction client was passed. With the flag off, the path is behaviorally identical to before. - Retry delay bounds are configurable (`RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS`/`MAX_MS`, default 50/200; `MAX_MS=0` skips the replica retry and goes straight to the primary). - The warn log fires only when the primary serves the read (the writer spill is the operationally interesting event); replica-retry recoveries are counted but quiet. A permanently-missing snapshot id stays an error-level failure with a `failedDuring` field, so lag metrics aren't polluted by bogus ids. - Stale-tail lag (replica has the since snapshot but not newer rows) deliberately still returns the replica's view; the next poll catches up. - The since-snapshot anchor lookup is now scoped to the polled run (`where: { id, runId }`), so a snapshot id from a different run raises not-found instead of silently anchoring a too-wide window of the run's snapshots. ## Test plan All vitest + testcontainers, no mocks. A new `schemaOnlyPrisma` fixture (migrated-but-empty clone database) simulates a replica that hasn't caught up, and a real in-memory OTel meter pins the counter semantics per outcome. - [x] Replica catches up during the jittered retry window → served by the replica, `outcome=replica_retry` = 1, primary never consulted - [x] Replica permanently missing the since snapshot → served by the primary, `outcome=primary` = 1 - [x] Snapshot missing on both replica and primary → null, counter = 0 - [x] Replica has the since snapshot but lags by one → the replica's view is served, no fallback (verified discriminating power: the test fails if reads secretly hit the primary) - [x] Flag off with a replica configured → primary serves the read - [x] Transaction client provided → bypasses the replica entirely - [x] Since snapshot belonging to a different run → null - [x] Existing getSnapshotsSince + waitpoints suites green; run-engine, testcontainers, and webapp typechecks pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )
This change is