Fix stale page module cache poisoning#5094
Conversation
Reproduces the staging incident where a prerender pool page running an old host bundle wrote a module error into the public modules cache, 500ing every anonymous card read; error-cache revalidation stayed pinned to the same page so the realm could not recover until a render landed on a fresh page. The stale page is simulated by a Prerenderer wrapper around the real test prerender server, so the cache write (cache_scope=public), the serve-time 500 via FilterRefersToNonexistentTypeError, the TTL-based revalidation, and the recovery path all run through the real stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Host Test Results 1 files ±0 1 suites ±0 1h 52m 5s ⏱️ - 1m 9s Results for commit 7e83702. ± Comparison against earlier commit f5f165f. Realm Server Test Results 1 files ± 0 1 suites ±0 11m 40s ⏱️ -53s Results for commit 7e83702. ± Comparison against earlier commit f5f165f. |
…e-poison-cs-11355
Models the routing behavior behind the staging incident: a prerender pool page whose loaded host bundle is stale fails every module render routed to it, and realm affinity keeps sending the realm's renders back to that same page. A real stale page differs from its peers only by the bytes already in its browser context, so a __test_poisonPage seam on PagePool (exposed through Prerenderer) injects that per-page failure by pageId; the module render path synthesizes the same module-error a genuine in-page import failure produces. The registry is never populated in production. The passing test drives the real maxPages:2 pool and shows the realm affinity stays pinned to the poisoned page across revalidations, then recovers on that same page once the bundle is fresh. A QUnit.todo captures the desired post-fix behavior — revalidation recovering via a healthy page while one page is stale — which fails today because every same-affinity render reuses the pinned page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pin Turn the desired-recovery case from a QUnit todo into a hard assertion: with a healthy page available, a revalidation should recover on its own. Today every same-affinity render reuses the pinned (stale) page, so all revalidations error and this fails — demonstrating the bug ahead of the fix. The assertion checks only that a revalidation comes back ready, so it is satisfied whether the fix recycles the stale page or routes the retry to a healthy one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a module render errors on a reused page that was not evicted, the prerenderer now recycles the affinity's page — dropping the shared context so the replacement reloads the host bundle cold — and retries once on a fresh page, instead of returning the error for caching. This keeps a page running a stale host bundle (an export the bundle predates) from pinning the realm affinity and reproducing the same module error on every revalidation. Auth failures and already-evicted errors (render timeouts, dead-runloop desyncs) are skipped, and the recycle is bounded to one per call so a genuinely broken module still errors on the fresh page and is cached as before. Updates the affinity test: the revalidation now recovers on a fresh page rather than staying pinned, and the previously-failing recovery test passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit ec53614.
Adds a `freshPage` hint to module prerendering. When definition-lookup re-renders a module whose cached entry is an expired error (error-cache revalidation), it sets the hint; the prerenderer then serves that render on a fresh page (commandeering a standby) instead of reusing the affinity's warm tab. This keeps a pool page running a stale host bundle from pinning the realm affinity and reproducing the same module error on every revalidation. The hint is plumbed through prerenderModule → prerenderModuleAttempt → PagePool.getPage and forwarded over the remote/HTTP path (remote-prerenderer + prerender-app). PagePool prefers a standby — and warms one if the pool has room — rather than disposing the affinity, so in-flight renders on the warm tab are untouched. Renders that don't set the hint (first renders, pre-warm, all indexing) behave exactly as before, so the indexing path is unaffected. The affinity test now passes the hint on the revalidation and asserts it recovers on a different page; a companion test confirms that without the hint a reused page is still reused (default behavior preserved). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The freshPage path gated its standby-warm await on currentStandbyCount() < desiredStandbyCount(). A standby mid-creation inflates currentStandbyCount (size + creating) to meet desired while #standbys.size is still 0, so the guard skipped the await, found no ready standby, and fell through to reusing the stale tab — the same trap the deadlock-escape path below already documents. Await #ensureStandbyPool unconditionally when no standby is free (it returns the in-flight refill if one is creating, and no-ops when the pool has no room) so the revalidation reliably lands on a fresh page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a2dc63540
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The freshPage path commandeered a fresh standby but left the affinity's existing warm tab in the pool. Since `#selectLRUTab` prefers the oldest `lastUsedAt`, the next un-hinted render in that realm would pick the still-idle stale tab and reproduce the stale-bundle module error on a different module's cache entry — recovering the one errored module while poisoning others. On freshPage, retire the affinity's IDLE tabs (close them and drop them from `#affinityPages`) before serving the render on a fresh standby, so a possibly-stale tab can't be re-handed to a later render. Only idle tabs are retired — in-flight renders on busy tabs are left to finish, so this doesn't tear down active work the way disposing the whole affinity would. Extends the affinity test: after a freshPage recovery, a later un-hinted render must not be routed back to the retired stale page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-poison-cs-11355
There was a problem hiding this comment.
Pull request overview
This PR addresses a production-risk failure mode where a prerender pool page running a stale host bundle can repeatedly reproduce an “expired cached module error” during revalidation, re-poisoning the public module cache and preventing recovery. The core fix adds a freshPage hint for error-cache revalidations so the prerenderer can bypass the affinity’s warm tab and fetch module definitions on a different (fresh) page.
Changes:
- Added
freshPage?: booleantoModulePrerenderArgs, and set it during cached expired error revalidation inCachingDefinitionLookup. - Plumbed
freshPagethrough the prerender HTTP route and remote prerenderer to the prerender pool, where it can retire idle affinity tabs and attempt to serve the render on a standby page. - Added realm-server tests covering both the stale-page affinity routing behavior and the public module-cache poisoning scenario.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/runtime-common/index.ts | Extends prerender module request contract with freshPage hint for error revalidation. |
| packages/runtime-common/definition-lookup.ts | Detects expired cached error revalidation and requests prerenders using freshPage. |
| packages/realm-server/prerender/render-runner.ts | Threads freshPage into tab acquisition and adds a test seam for simulating stale-bundle module errors. |
| packages/realm-server/prerender/remote-prerenderer.ts | Forwards freshPage over the remote prerender JSON:API request. |
| packages/realm-server/prerender/prerenderer.ts | Accepts and forwards freshPage to the render-runner attempt path. |
| packages/realm-server/prerender/prerender-app.ts | Parses freshPage from request attributes and forwards it to prerenderModule. |
| packages/realm-server/prerender/page-pool.ts | Implements the freshPage behavior (retire idle affinity tabs + prefer standby) and adds test-only poisoning hooks. |
| packages/realm-server/tests/prerender-stale-page-affinity-test.ts | New test that exercises escaping a pinned stale page via freshPage. |
| packages/realm-server/tests/module-cache-poison-test.ts | New test reproducing stale-page poisoning of the public modules cache and TTL revalidation behavior. |
| packages/realm-server/tests/index.ts | Registers the new tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (standby) { | ||
| let releaseTab = await standby.queue.acquire(signal, priority); | ||
| return { entry: standby, reused: false, releaseTab, tabStartupMs }; | ||
| } |
| if (freshPage) { | ||
| this.#retireIdleAffinityTabs(affinityKey); | ||
| // Recompute: the tabs just marked `closing` must not be eligible for | ||
| // the fall-through selection below. | ||
| entries = this.#affinityPages.get(affinityKey); |
|
if we believe that the prerender server is encountering this error because the cached host page is out of date, then i feel like the approach should be different. It should be the responsibility of the prerender server to dump it's host instead of teh callers of the server to own that decision. I would prefer to see the prerender manager send a signal to all the prerender servers that the host has been recycled and to dump all tabs, as opposed to each caller probing for a bad render server. this means that the host deployment CI should send a signal to the prerender manager that the host has been redeployed, and then the prerender manager can send a signal to all the prerender servers to dump all their tabs as soon as all the in progress work settles. if that doesn't actually solve this issue, then that means we don't actually understand what the problem is. this new option totally subverts the prerender caching and it feels like a bandaid, not a solution. |
👍🏻 thanks, I definitely am out of my depth here, so I appreciate the insight! I’ll close this and try the signal approach. |
|
@backspace I think an even simpler way to fix this if indeed the old host was being cached would be do make a slight adjustment to the deploy script. right now the host app and prerender server deploy in parallel. but if we made the prerenderer only deploy after the host app is deployed, then when the prerender servers are restarted they will pick up the new host app. |
The Checkly rehydration check has failed periodically in staging. The SSR version looks fine, but when Ember
hosttakes over, a runtime error shows:Claude’s analysis of why:
So far this hasn’t happened in production but it could.
The fix: a
freshPageoption when requesting a prerendering page, used when definition lookup is rerendering a module that has a cached expired error.freshPagebypasses the warmed tab so the new module definition can be fetched.There’s a new realm server test, failing here before the fix was implemented.