Skip to content

Fix stale page module cache poisoning#5094

Closed
backspace wants to merge 10 commits into
mainfrom
prerender-module-cache-poison-cs-11355
Closed

Fix stale page module cache poisoning#5094
backspace wants to merge 10 commits into
mainfrom
prerender-module-cache-poison-cs-11355

Conversation

@backspace

@backspace backspace commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

The Checkly rehydration check has failed periodically in staging. The SSR version looks fine, but when Ember host takes over, a runtime error shows:

CleanShot 2026-06-09 at 08 38 14@2x

Claude’s analysis of why:

  1. A host bundle change that adds/removes a runtime-common export.
  2. Base-realm card-api.gts that imports the new export, deployed and indexed.
  3. A prerender pool page that survives the deploy still running the old bundle, pinned by realm
    affinity to the base realm.
  4. The immutable-index.html caching that lets a freshly-booted prerender pick up a stale shell
    (remediation item 1).
  5. Anonymous traffic to a published realm whose index card resolves the affected type, hitting the
    poisoned public cache.

So far this hasn’t happened in production but it could.

The fix: a freshPage option when requesting a prerendering page, used when definition lookup is rerendering a module that has a cached expired error. freshPage bypasses the warmed tab so the new module definition can be fetched.

There’s a new realm server test, failing here before the fix was implemented.

Reproduces the staging incident where a prerender pool page running an
old host bundle wrote a module error into the public modules cache,
500ing every anonymous card read; error-cache revalidation stayed
pinned to the same page so the realm could not recover until a render
landed on a fresh page.

The stale page is simulated by a Prerenderer wrapper around the real
test prerender server, so the cache write (cache_scope=public), the
serve-time 500 via FilterRefersToNonexistentTypeError, the TTL-based
revalidation, and the recovery path all run through the real stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Host Test Results

    1 files  ±0      1 suites  ±0   1h 52m 5s ⏱️ - 1m 9s
3 033 tests ±0  3 018 ✅ ±0  15 💤 ±0  0 ❌ ±0 
3 052 runs  ±0  3 037 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 7e83702. ± Comparison against earlier commit f5f165f.

Realm Server Test Results

    1 files  ±    0      1 suites  ±0   11m 40s ⏱️ -53s
1 618 tests +    2  1 618 ✅ +    3  0 💤 ±0  0 ❌  - 1 
1 709 runs   - 1 705  1 709 ✅  - 1 704  0 💤 ±0  0 ❌  - 1 

Results for commit 7e83702. ± Comparison against earlier commit f5f165f.

backspace and others added 7 commits June 4, 2026 07:22
Models the routing behavior behind the staging incident: a prerender
pool page whose loaded host bundle is stale fails every module render
routed to it, and realm affinity keeps sending the realm's renders back
to that same page. A real stale page differs from its peers only by the
bytes already in its browser context, so a __test_poisonPage seam on
PagePool (exposed through Prerenderer) injects that per-page failure by
pageId; the module render path synthesizes the same module-error a
genuine in-page import failure produces. The registry is never populated
in production.

The passing test drives the real maxPages:2 pool and shows the realm
affinity stays pinned to the poisoned page across revalidations, then
recovers on that same page once the bundle is fresh. A QUnit.todo
captures the desired post-fix behavior — revalidation recovering via a
healthy page while one page is stale — which fails today because every
same-affinity render reuses the pinned page.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pin

Turn the desired-recovery case from a QUnit todo into a hard assertion:
with a healthy page available, a revalidation should recover on its own.
Today every same-affinity render reuses the pinned (stale) page, so all
revalidations error and this fails — demonstrating the bug ahead of the
fix. The assertion checks only that a revalidation comes back ready, so
it is satisfied whether the fix recycles the stale page or routes the
retry to a healthy one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a module render errors on a reused page that was not evicted, the
prerenderer now recycles the affinity's page — dropping the shared
context so the replacement reloads the host bundle cold — and retries
once on a fresh page, instead of returning the error for caching. This
keeps a page running a stale host bundle (an export the bundle predates)
from pinning the realm affinity and reproducing the same module error on
every revalidation. Auth failures and already-evicted errors (render
timeouts, dead-runloop desyncs) are skipped, and the recycle is bounded
to one per call so a genuinely broken module still errors on the fresh
page and is cached as before.

Updates the affinity test: the revalidation now recovers on a fresh page
rather than staying pinned, and the previously-failing recovery test
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a `freshPage` hint to module prerendering. When definition-lookup
re-renders a module whose cached entry is an expired error (error-cache
revalidation), it sets the hint; the prerenderer then serves that render
on a fresh page (commandeering a standby) instead of reusing the
affinity's warm tab. This keeps a pool page running a stale host bundle
from pinning the realm affinity and reproducing the same module error on
every revalidation.

The hint is plumbed through prerenderModule → prerenderModuleAttempt →
PagePool.getPage and forwarded over the remote/HTTP path
(remote-prerenderer + prerender-app). PagePool prefers a standby — and
warms one if the pool has room — rather than disposing the affinity, so
in-flight renders on the warm tab are untouched. Renders that don't set
the hint (first renders, pre-warm, all indexing) behave exactly as
before, so the indexing path is unaffected.

The affinity test now passes the hint on the revalidation and asserts it
recovers on a different page; a companion test confirms that without the
hint a reused page is still reused (default behavior preserved).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The freshPage path gated its standby-warm await on
currentStandbyCount() < desiredStandbyCount(). A standby mid-creation
inflates currentStandbyCount (size + creating) to meet desired while
#standbys.size is still 0, so the guard skipped the await, found no
ready standby, and fell through to reusing the stale tab — the same
trap the deadlock-escape path below already documents. Await
#ensureStandbyPool unconditionally when no standby is free (it returns
the in-flight refill if one is creating, and no-ops when the pool has
no room) so the revalidation reliably lands on a fresh page.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@backspace backspace marked this pull request as ready for review June 10, 2026 18:24

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a2dc63540

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/page-pool.ts
backspace and others added 2 commits June 10, 2026 14:27
The freshPage path commandeered a fresh standby but left the affinity's
existing warm tab in the pool. Since `#selectLRUTab` prefers the oldest
`lastUsedAt`, the next un-hinted render in that realm would pick the
still-idle stale tab and reproduce the stale-bundle module error on a
different module's cache entry — recovering the one errored module while
poisoning others.

On freshPage, retire the affinity's IDLE tabs (close them and drop them
from `#affinityPages`) before serving the render on a fresh standby, so a
possibly-stale tab can't be re-handed to a later render. Only idle tabs
are retired — in-flight renders on busy tabs are left to finish, so this
doesn't tear down active work the way disposing the whole affinity would.

Extends the affinity test: after a freshPage recovery, a later un-hinted
render must not be routed back to the retired stale page.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@backspace backspace requested a review from a team June 11, 2026 12:34
@habdelra habdelra requested a review from Copilot June 11, 2026 13:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a production-risk failure mode where a prerender pool page running a stale host bundle can repeatedly reproduce an “expired cached module error” during revalidation, re-poisoning the public module cache and preventing recovery. The core fix adds a freshPage hint for error-cache revalidations so the prerenderer can bypass the affinity’s warm tab and fetch module definitions on a different (fresh) page.

Changes:

  • Added freshPage?: boolean to ModulePrerenderArgs, and set it during cached expired error revalidation in CachingDefinitionLookup.
  • Plumbed freshPage through the prerender HTTP route and remote prerenderer to the prerender pool, where it can retire idle affinity tabs and attempt to serve the render on a standby page.
  • Added realm-server tests covering both the stale-page affinity routing behavior and the public module-cache poisoning scenario.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/runtime-common/index.ts Extends prerender module request contract with freshPage hint for error revalidation.
packages/runtime-common/definition-lookup.ts Detects expired cached error revalidation and requests prerenders using freshPage.
packages/realm-server/prerender/render-runner.ts Threads freshPage into tab acquisition and adds a test seam for simulating stale-bundle module errors.
packages/realm-server/prerender/remote-prerenderer.ts Forwards freshPage over the remote prerender JSON:API request.
packages/realm-server/prerender/prerenderer.ts Accepts and forwards freshPage to the render-runner attempt path.
packages/realm-server/prerender/prerender-app.ts Parses freshPage from request attributes and forwards it to prerenderModule.
packages/realm-server/prerender/page-pool.ts Implements the freshPage behavior (retire idle affinity tabs + prefer standby) and adds test-only poisoning hooks.
packages/realm-server/tests/prerender-stale-page-affinity-test.ts New test that exercises escaping a pinned stale page via freshPage.
packages/realm-server/tests/module-cache-poison-test.ts New test reproducing stale-page poisoning of the public modules cache and TTL revalidation behavior.
packages/realm-server/tests/index.ts Registers the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1966 to +1969
if (standby) {
let releaseTab = await standby.queue.acquire(signal, priority);
return { entry: standby, reused: false, releaseTab, tabStartupMs };
}
Comment on lines +1940 to +1944
if (freshPage) {
this.#retireIdleAffinityTabs(affinityKey);
// Recompute: the tabs just marked `closing` must not be eligible for
// the fall-through selection below.
entries = this.#affinityPages.get(affinityKey);
@habdelra

habdelra commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

if we believe that the prerender server is encountering this error because the cached host page is out of date, then i feel like the approach should be different. It should be the responsibility of the prerender server to dump it's host instead of teh callers of the server to own that decision. I would prefer to see the prerender manager send a signal to all the prerender servers that the host has been recycled and to dump all tabs, as opposed to each caller probing for a bad render server. this means that the host deployment CI should send a signal to the prerender manager that the host has been redeployed, and then the prerender manager can send a signal to all the prerender servers to dump all their tabs as soon as all the in progress work settles.

if that doesn't actually solve this issue, then that means we don't actually understand what the problem is. this new option totally subverts the prerender caching and it feels like a bandaid, not a solution.

@backspace

Copy link
Copy Markdown
Contributor Author

if we believe that the prerender server is encountering this error because the cached host page is out of date, then i feel like the approach should be different. It should be the responsibility of the prerender server to dump it's host instead of teh callers of the server to own that decision. I would prefer to see the prerender manager send a signal to all the prerender servers that the host has been recycled and to dump all tabs, as opposed to each caller probing for a bad render server. this means that the host deployment CI should send a signal to the prerender manager that the host has been redeployed, and then the prerender manager can send a signal to all the prerender servers to dump all their tabs as soon as all the in progress work settles.

if that doesn't actually solve this issue, then that means we don't actually understand what the problem is. this new option totally subverts the prerender caching and it feels like a bandaid, not a solution.

👍🏻 thanks, I definitely am out of my depth here, so I appreciate the insight! I’ll close this and try the signal approach.

@habdelra

habdelra commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@backspace I think an even simpler way to fix this if indeed the old host was being cached would be do make a slight adjustment to the deploy script. right now the host app and prerender server deploy in parallel. but if we made the prerenderer only deploy after the host app is deployed, then when the prerender servers are restarted they will pick up the new host app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants