Skip to content

Instrument the realm-server search path with a request→response timeline (CS-11363)#5098

Open
habdelra wants to merge 2 commits into
mainfrom
worktree-cs-search-server-timing
Open

Instrument the realm-server search path with a request→response timeline (CS-11363)#5098
habdelra wants to merge 2 commits into
mainfrom
worktree-cs-search-server-timing

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented Jun 3, 2026

What this does

Adds server-side timing instrumentation to the realm-server's federated-search path, so we can answer where a slow _search spends its time — from request receipt to response sent ("hoof to snout") — instead of guessing.

Why

During a from-scratch index of a query-field-heavy realm, prerendered cards stall in the waiting-stability stage for tens of seconds — up to the 90s render timeout — waiting on an in-render _search round-trip (a query-backed linksToMany getter, e.g. a customer's policies field). The card's own queryLoadsInFlight diagnostics show the search in flight for ~88s, yet pg_stat_activity shows zero SQL running longer than a second. So the time is not query execution — the realm-server (a single Node process) is slow to deliver the response, and we had no way to see why.

The gap: the host-side boxel_index.diagnostics column already captures the client's view (queryLoadsInFlight/recentQueryLoads — that's how we know it waited 88s), but the server side of that same request was completely unmetered. The whole pipeline — receipt → SQL → the post-SQL loadLinks relationship/query-field assembly → the per-instance cache round-trips → JSON.stringify of the wire-format → send — emitted nothing, there was no per-request duration, and nothing linked the client's view to the server's.

How it fits together

The search request flows: prerendered host SPA issues _federated-search → realm-server handle-searchsearchRealms → per-realm Realm.searchRealmIndexQueryEngine.searchCardsloadLinks → SQL. The instrumentation threads a small RequestTimings collector down that existing path (via the shared SearchOpts, alongside the job-id/priority already threaded there) and emits one line per request.

The four pieces:

  1. Correlation id (client → server). requestIdHeader() mints x-boxel-request-id on the prerendered host's _federated-search fetch, gated on the prerender context exactly like the existing job-id / consuming-realm headers — so live SPA traffic stamps nothing. It's a join key only; no new client-side measurement (the client wait is already in the diagnostics column). The realm-server reads it back out and keys its timeline on it.

  2. Request→response timeline. handle-search owns a RequestTimings collector (created only when a correlation id is present) and stamps parse, resolveRealms, and stringify directly; searchRealms/searchCards/loadLinks stamp sql, loadLinks, populate (the relationship/query-field assembly), and the per-instance cacheRead/cacheWrite + hit/miss counts onto the same collector. The handler emits one realm:search-timing line: req=… job=… handler=Nms parse=… resolveRealms=… sql=… loadLinks=… populate=… stringify=… | results=… cacheHit=… cacheMiss=…. The correlation id is deliberately kept out of the job-scoped search cache's key material so it can't blow up the cache.

  3. Request duration + id in the request log. The realm:requests middleware now logs the <-- … --> round-trip dur= and the req= id — the outermost request→response bound (includes body read + send), and the join key tying the timeline line, the request line, and the client diagnostics together.

  4. Event-loop health sampler. A periodic realm:health line (only during saturation windows) reporting event-loop lag (mean/p99/max) alongside the in-flight _search count and heap. This is the signal that confirms or refutes the prime suspect for the 88s — the single-threaded process's event loop being starved (by synchronous post-SQL serialization across many concurrent searches) so requests sit unserviced. The gap between the middleware dur= total and the handler's stage sum, read against event-loop lag at the time, localizes queue-before-handler time.

Everything is gated on the presence of a correlation id, so normal/live traffic allocates no collector and emits no extra lines. RequestTimings is threaded through runtime-common but only ever runs in the realm-server (RealmIndexQueryEngine needs the Postgres adapter the browser doesn't have), so the host is unaffected.

Test

A host integration test exercises the whole path end-to-end through the in-realm browser: the SPA mints the correlation id (real requestIdHeader()), issues a real store.search, and the test asserts the id lands in the server's realm:search-timing line (with the sql/loadLinks stage breakdown), via the realm-server mock handing the header to the real searchRealms. A second case asserts that non-prerender traffic stamps no header and emits no line.

Scope

Measurement only — this localizes where the latency is so we can fix it; it does not change the search behavior or fix the latency itself.

Linear: CS-11363

…ine (CS-11363)

During a from-scratch index, prerendered cards block for tens of seconds on
in-render `_search` round-trips while the SQL behind them runs in
milliseconds. We could see the client wait (host-side `boxel_index.diagnostics`)
but had no server-side view of where those seconds went. This adds that view:

- A client-minted `x-boxel-request-id` correlation id, stamped on the
  prerendered host's `_federated-search` fetch (prerender-gated; live SPA
  traffic is untouched). Join key only — no new client measurement.
- A `realm:search-timing` line keyed by that id with the request→response
  stage breakdown: parse, resolveRealms, sql, loadLinks, populate
  (relationship/query-field assembly), per-instance cache read/write +
  hit/miss, stringify, and result count. Emitted by `handle-search` for the
  full handler picture, and standalone by `searchRealms` on the host-test
  path. Gated on a correlation id, so normal traffic allocates nothing.
- Request duration + the correlation id on the `realm:requests` log lines
  (the outermost request→response bound, incl. body read + send).
- A `realm:health` sampler: event-loop lag (mean/p99/max) alongside the
  in-flight `_search` count and heap, logged only during saturation windows
  — the signal that confirms or refutes single-thread saturation directly.

The timing collector is threaded through the shared `SearchOpts` and is a
no-op without a correlation id; `RealmIndexQueryEngine` only ever runs in the
realm-server, so the host stays unaffected.

A host integration test exercises the whole path end-to-end: the in-realm
browser mints the id, issues a real `store.search`, and the test asserts the
id lands in the server's `realm:search-timing` line (and that live,
non-prerender traffic stamps nothing and emits nothing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 498aab2c5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/runtime-common/realm.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end, request-correlated timing instrumentation for the realm-server federated search pipeline so slow _search calls can be attributed to specific server-side stages (parse, realm resolution, SQL, link loading/population, cache read/write, stringify) and correlated with request logs and client diagnostics.

Changes:

  • Introduces a RequestTimings collector and threads it through searchRealmsRealm.searchRealmIndexQueryEngine to stamp stage durations + counters and emit a single realm:search-timing line keyed by a client-minted correlation id.
  • Adds a prerender-only x-boxel-request-id header minted by the host SPA and sanitized/echoed by the realm-server request middleware, including dur= in the realm:requests log for full request→response timing.
  • Adds a realm-server health sampler that periodically records event-loop lag + in-flight search count during saturation windows, plus integration coverage verifying header stamping and timing-line emission behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/runtime-common/search-utils.ts Extends SearchOpts with correlation/timings, adds a testable timing-log emitter, and optionally emits a per-request timing line when it “owns” the collector.
packages/runtime-common/request-timings.ts New request-scoped stage timer + counters with compact log formatting.
packages/runtime-common/realm.ts Threads timings through realm search options into the query engine.
packages/runtime-common/realm-index-query-engine.ts Times sql, loadLinks, populate, and per-instance cache read/write; increments cache hit/miss counters.
packages/runtime-common/prerender-headers.ts Defines/sanitizes x-boxel-request-id to prevent log forging and provide a shared header name.
packages/runtime-common/index.ts Exports RequestTimings from runtime-common’s public surface.
packages/realm-server/search-inflight.ts New module-level in-flight _search counter for health sampling correlation.
packages/realm-server/middleware/index.ts Adds req= correlation + dur= to request logs, echoes sanitized request id, and tracks in-flight search lifecycle (finish/close).
packages/realm-server/main.ts Starts/stops the health sampler with the server lifecycle.
packages/realm-server/health-sampler.ts New event-loop lag sampler that logs realm:health only during saturation windows.
packages/realm-server/handlers/handle-search.ts Creates/threads request timing collector (when correlation id present), stamps parse/resolve/stringify, and emits the full timeline line (incl. cache-hit paths).
packages/host/tests/integration/search-correlation-id-test.gts New integration test asserting prerender-only header stamping and server timing-line correlation.
packages/host/tests/helpers/realm-server-mock/routes.ts Threads sanitized request id into searchRealms in the mock so tests observe real timing emission.
packages/host/app/services/store.ts Adds requestIdHeader() to _federated-search requests.
packages/host/app/lib/prerender-fetch-headers.ts Implements prerender-gated per-request correlation id minting with crypto.randomUUID fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Preview deployments

Host Test Results

    1 files      1 suites   1h 50m 18s ⏱️
2 933 tests 2 918 ✅ 15 💤 0 ❌
2 952 runs  2 937 ✅ 15 💤 0 ❌

Results for commit 706ed8b.

Realm Server Test Results

    1 files      1 suites   11m 0s ⏱️
1 549 tests 1 548 ✅ 1 💤 0 ❌
1 640 runs  1 639 ✅ 1 💤 0 ❌

Results for commit 706ed8b.

…ed-search timing

- CORS: add the correlation header to the realm-server's `Access-Control-Allow-Headers`
  allowlist. Without it the cross-origin preflight blocks the prerender's
  `_federated-search` once the host stamps the header.
- Rename `x-boxel-request-id` → `x-boxel-logging-correlation-id` (and the
  internal field/vars to `loggingCorrelationId`, the log token to `corr=`):
  the old name read too much like the unrelated `x-boxel-client-request-id`,
  which is realm-event/write correlation, a different concern.
- searchCards in-flight coalescing: exclude the per-request `timings` collector
  from `searchInFlightKey` so it can't perturb the dedup key, and when a follower
  awaits an already-running identical search, record that wait as `coalescedWait`
  on the follower's own collector — otherwise its timing line would show no
  sql/loadLinks and look misleadingly instant under concurrent search load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team June 3, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants