Fix stale page module cache poisoning by backspace · Pull Request #5094 · cardstack/boxel

backspace · 2026-06-03T19:36:20Z

The Checkly rehydration check has failed periodically in staging. The SSR version looks fine, but when Ember host takes over, a runtime error shows:

Claude’s analysis of why:

A host bundle change that adds/removes a runtime-common export.

Base-realm card-api.gts that imports the new export, deployed and indexed.

A prerender pool page that survives the deploy still running the old bundle, pinned by realm
affinity to the base realm.

The immutable-index.html caching that lets a freshly-booted prerender pick up a stale shell
(remediation item 1).

Anonymous traffic to a published realm whose index card resolves the affected type, hitting the
poisoned public cache.

So far this hasn’t happened in production but it could.

The fix: a freshPage option when requesting a prerendering page, used when definition lookup is rerendering a module that has a cached expired error. freshPage bypasses the warmed tab so the new module definition can be fetched.

There’s a new realm server test, failing here before the fix was implemented.

Reproduces the staging incident where a prerender pool page running an old host bundle wrote a module error into the public modules cache, 500ing every anonymous card read; error-cache revalidation stayed pinned to the same page so the realm could not recover until a render landed on a fresh page. The stale page is simulated by a Prerenderer wrapper around the real test prerender server, so the cache write (cache_scope=public), the serve-time 500 via FilterRefersToNonexistentTypeError, the TTL-based revalidation, and the recovery path all run through the real stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-03T20:08:53Z

Host Test Results

1 files ±0 1 suites ±0 1h 52m 5s ⏱️ - 1m 9s
3 033 tests ±0 3 018 ✅ ±0 15 💤 ±0 0 ❌ ±0
3 052 runs ±0 3 037 ✅ ±0 15 💤 ±0 0 ❌ ±0

Results for commit 7e83702. ± Comparison against earlier commit f5f165f.

Realm Server Test Results

1 files ± 0 1 suites ±0 11m 40s ⏱️ -53s
1 618 tests + 2 1 618 ✅ + 3 0 💤 ±0 0 ❌ - 1
1 709 runs - 1 705 1 709 ✅ - 1 704 0 💤 ±0 0 ❌ - 1

Results for commit 7e83702. ± Comparison against earlier commit f5f165f.

…e-poison-cs-11355

Models the routing behavior behind the staging incident: a prerender pool page whose loaded host bundle is stale fails every module render routed to it, and realm affinity keeps sending the realm's renders back to that same page. A real stale page differs from its peers only by the bytes already in its browser context, so a __test_poisonPage seam on PagePool (exposed through Prerenderer) injects that per-page failure by pageId; the module render path synthesizes the same module-error a genuine in-page import failure produces. The registry is never populated in production. The passing test drives the real maxPages:2 pool and shows the realm affinity stays pinned to the poisoned page across revalidations, then recovers on that same page once the bundle is fresh. A QUnit.todo captures the desired post-fix behavior — revalidation recovering via a healthy page while one page is stale — which fails today because every same-affinity render reuses the pinned page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… pin Turn the desired-recovery case from a QUnit todo into a hard assertion: with a healthy page available, a revalidation should recover on its own. Today every same-affinity render reuses the pinned (stale) page, so all revalidations error and this fails — demonstrating the bug ahead of the fix. The assertion checks only that a revalidation comes back ready, so it is satisfied whether the fix recycles the stale page or routes the retry to a healthy one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When a module render errors on a reused page that was not evicted, the prerenderer now recycles the affinity's page — dropping the shared context so the replacement reloads the host bundle cold — and retries once on a fresh page, instead of returning the error for caching. This keeps a page running a stale host bundle (an export the bundle predates) from pinning the realm affinity and reproducing the same module error on every revalidation. Auth failures and already-evicted errors (render timeouts, dead-runloop desyncs) are skipped, and the recycle is bounded to one per call so a genuinely broken module still errors on the fresh page and is cached as before. Updates the affinity test: the revalidation now recovers on a fresh page rather than staying pinned, and the previously-failing recovery test passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This reverts commit ec53614.

Adds a `freshPage` hint to module prerendering. When definition-lookup re-renders a module whose cached entry is an expired error (error-cache revalidation), it sets the hint; the prerenderer then serves that render on a fresh page (commandeering a standby) instead of reusing the affinity's warm tab. This keeps a pool page running a stale host bundle from pinning the realm affinity and reproducing the same module error on every revalidation. The hint is plumbed through prerenderModule → prerenderModuleAttempt → PagePool.getPage and forwarded over the remote/HTTP path (remote-prerenderer + prerender-app). PagePool prefers a standby — and warms one if the pool has room — rather than disposing the affinity, so in-flight renders on the warm tab are untouched. Renders that don't set the hint (first renders, pre-warm, all indexing) behave exactly as before, so the indexing path is unaffected. The affinity test now passes the hint on the revalidation and asserts it recovers on a different page; a companion test confirms that without the hint a reused page is still reused (default behavior preserved). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The freshPage path gated its standby-warm await on currentStandbyCount() < desiredStandbyCount(). A standby mid-creation inflates currentStandbyCount (size + creating) to meet desired while #standbys.size is still 0, so the guard skipped the await, found no ready standby, and fell through to reusing the stale tab — the same trap the deadlock-escape path below already documents. Await #ensureStandbyPool unconditionally when no standby is free (it returns the in-flight refill if one is creating, and no-ops when the pool has no room) so the revalidation reliably lands on a fresh page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a2dc63540

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The freshPage path commandeered a fresh standby but left the affinity's existing warm tab in the pool. Since `#selectLRUTab` prefers the oldest `lastUsedAt`, the next un-hinted render in that realm would pick the still-idle stale tab and reproduce the stale-bundle module error on a different module's cache entry — recovering the one errored module while poisoning others. On freshPage, retire the affinity's IDLE tabs (close them and drop them from `#affinityPages`) before serving the render on a fresh standby, so a possibly-stale tab can't be re-handed to a later render. Only idle tabs are retired — in-flight renders on busy tabs are left to finish, so this doesn't tear down active work the way disposing the whole affinity would. Extends the affinity test: after a freshPage recovery, a later un-hinted render must not be routed back to the retired stale page. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-poison-cs-11355

Copilot

Pull request overview

This PR addresses a production-risk failure mode where a prerender pool page running a stale host bundle can repeatedly reproduce an “expired cached module error” during revalidation, re-poisoning the public module cache and preventing recovery. The core fix adds a freshPage hint for error-cache revalidations so the prerenderer can bypass the affinity’s warm tab and fetch module definitions on a different (fresh) page.

Changes:

Added freshPage?: boolean to ModulePrerenderArgs, and set it during cached expired error revalidation in CachingDefinitionLookup.
Plumbed freshPage through the prerender HTTP route and remote prerenderer to the prerender pool, where it can retire idle affinity tabs and attempt to serve the render on a standby page.
Added realm-server tests covering both the stale-page affinity routing behavior and the public module-cache poisoning scenario.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/runtime-common/index.ts	Extends prerender module request contract with `freshPage` hint for error revalidation.
packages/runtime-common/definition-lookup.ts	Detects expired cached error revalidation and requests prerenders using `freshPage`.
packages/realm-server/prerender/render-runner.ts	Threads `freshPage` into tab acquisition and adds a test seam for simulating stale-bundle module errors.
packages/realm-server/prerender/remote-prerenderer.ts	Forwards `freshPage` over the remote prerender JSON:API request.
packages/realm-server/prerender/prerenderer.ts	Accepts and forwards `freshPage` to the render-runner attempt path.
packages/realm-server/prerender/prerender-app.ts	Parses `freshPage` from request attributes and forwards it to `prerenderModule`.
packages/realm-server/prerender/page-pool.ts	Implements the `freshPage` behavior (retire idle affinity tabs + prefer standby) and adds test-only poisoning hooks.
packages/realm-server/tests/prerender-stale-page-affinity-test.ts	New test that exercises escaping a pinned stale page via `freshPage`.
packages/realm-server/tests/module-cache-poison-test.ts	New test reproducing stale-page poisoning of the public modules cache and TTL revalidation behavior.
packages/realm-server/tests/index.ts	Registers the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      if (standby) {
+        let releaseTab = await standby.queue.acquire(signal, priority);
+        return { entry: standby, reused: false, releaseTab, tabStartupMs };
+      }


+    if (freshPage) {
+      this.#retireIdleAffinityTabs(affinityKey);
+      // Recompute: the tabs just marked `closing` must not be eligible for
+      // the fall-through selection below.
+      entries = this.#affinityPages.get(affinityKey);


habdelra · 2026-06-11T13:22:06Z

if we believe that the prerender server is encountering this error because the cached host page is out of date, then i feel like the approach should be different. It should be the responsibility of the prerender server to dump it's host instead of teh callers of the server to own that decision. I would prefer to see the prerender manager send a signal to all the prerender servers that the host has been recycled and to dump all tabs, as opposed to each caller probing for a bad render server. this means that the host deployment CI should send a signal to the prerender manager that the host has been redeployed, and then the prerender manager can send a signal to all the prerender servers to dump all their tabs as soon as all the in progress work settles.

if that doesn't actually solve this issue, then that means we don't actually understand what the problem is. this new option totally subverts the prerender caching and it feels like a bandaid, not a solution.

backspace · 2026-06-11T14:03:49Z

if we believe that the prerender server is encountering this error because the cached host page is out of date, then i feel like the approach should be different. It should be the responsibility of the prerender server to dump it's host instead of teh callers of the server to own that decision. I would prefer to see the prerender manager send a signal to all the prerender servers that the host has been recycled and to dump all tabs, as opposed to each caller probing for a bad render server. this means that the host deployment CI should send a signal to the prerender manager that the host has been redeployed, and then the prerender manager can send a signal to all the prerender servers to dump all their tabs as soon as all the in progress work settles.

if that doesn't actually solve this issue, then that means we don't actually understand what the problem is. this new option totally subverts the prerender caching and it feels like a bandaid, not a solution.

👍🏻 thanks, I definitely am out of my depth here, so I appreciate the insight! I’ll close this and try the signal approach.

habdelra · 2026-06-11T15:53:15Z

@backspace I think an even simpler way to fix this if indeed the old host was being cached would be do make a slight adjustment to the deploy script. right now the host app and prerender server deploy in parallel. but if we made the prerenderer only deploy after the host app is deployed, then when the prerender servers are restarted they will pick up the new host app.

backspace and others added 7 commits June 4, 2026 07:22

Merge remote-tracking branch 'origin/main' into prerender-module-cach…

f12d58e

…e-poison-cs-11355

Revert "Recycle a stale pool page on a reused-page module error"

e7eb26c

This reverts commit ec53614.

backspace mentioned this pull request Jun 9, 2026

Fix ember-cli-deploy deprecations #5175

Merged

backspace marked this pull request as ready for review June 10, 2026 18:24

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread packages/realm-server/prerender/page-pool.ts

backspace and others added 2 commits June 10, 2026 14:27

Merge remote-tracking branch 'origin/main' into prerender-module-cach…

7e83702

…e-poison-cs-11355

backspace requested a review from a team June 11, 2026 12:34

habdelra requested a review from Copilot June 11, 2026 13:11

Copilot started reviewing on behalf of habdelra June 11, 2026 13:11 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

backspace closed this Jun 11, 2026

backspace mentioned this pull request Jun 11, 2026

Add recycle signal to prerenderers on host update #5205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stale page module cache poisoning#5094

Fix stale page module cache poisoning#5094
backspace wants to merge 10 commits into
mainfrom
prerender-module-cache-poison-cs-11355

backspace commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

habdelra commented Jun 11, 2026 •

edited

Loading

Uh oh!

backspace commented Jun 11, 2026

Uh oh!

habdelra commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

backspace commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

habdelra commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

backspace commented Jun 11, 2026

Uh oh!

habdelra commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

backspace commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading

habdelra commented Jun 11, 2026 •

edited

Loading

habdelra commented Jun 11, 2026 •

edited

Loading