docs: catch up DEVLOG + architecture after the launch-readiness audit

LEANDERANTONY · LEANDERANTONY · commit d5518a6e1f72 · 2026-05-30T22:51:29.000+05:30
DEVLOG Day 79 covers the launch-readiness audit + the 10-fix launch PR (merge a868b24): 73-agent discovery + 3-lens adversarial verification, 2 Criticals (SECURITY-1 BOLA + CRITICAL-2 async quota envelope), 8 Highs (FLOW-3, FE-SEC-1, BACKEND-2, LLM-1+OBS-1, OBS-2, PERF-1+2, A11Y-1+2, TEST-1), and the deferrals (H1, PERFDB-1/2/3/4, TEST-2). DEVLOG Day 80 covers the Medium + Low cleanup PR (merge 507cb3f): 24 Mediums across three domain-coherent phases + 8 Lows one commit each, plus deferrals (M3, M15, M20, M19 multi-row, M11 follow-ups) and the five Architectural Recommendations (R1-R5) parked in report.md. architecture.md splices: - backend/ section now mentions backend/services/workspace_run_jobs.py (owner-scoped, sync quota pre-flight, per-user in-flight cap) and the admin-gated /health/sentry-debug - Observability section now records the Sentry stage-boundary breadcrumbs/tags/context/user, the saved-workspaces-retention sentry_cron_monitor, and the X-PostHog-Distinct-Id header for anonymous attribution - Persistence Model now records the app_users BEFORE-UPDATE entitlement trigger, the atomic save_saved_job RPC, the count_active() workspace-quota head-read, and the saved_workspaces 1/1/1 single-slot reality - New "Browser security baseline" subsection documents the X-Frame-Options/HSTS/nosniff/Referrer-Policy/CSP-Report-Only header set, the safeRedirect/isAllowedRedirect allowlist on backend- supplied URLs, and the useAccessibleDialog primitive behind the ⌘K palette + assistant FAB report.md (intentionally untracked per docs/README.md governance) got a new PARKED (2026-05-30) section that captures the deferred Highs + Mediums + Lows plus the five Architectural Recommendations verbatim from the audit report, so the source-of-findings worktree can be cleaned up without losing the open items.
diff --git a/docs/DEVLOG.md b/docs/DEVLOG.md
@@ -3983,3 +3983,192 @@ pg_cron now records a fast 202 instead of a 524; nothing else changes.
 
 Verification: test suite green; the `/admin/refresh-cache` endpoint
 test was updated to assert 202 + a scheduled background worker.
+
+## Day 79: Launch-readiness audit — 73-agent swarm, 10 fixes shipped
+
+Twitter-launch readiness pass. A 12-domain discovery swarm (security,
+correctness/concurrency, performance/DB, LLM integration, frontend
+security/correctness/perf/a11y, API-contract integrity, observability,
+testing, E2E flows) read the codebase in parallel, then every Critical
+and High finding went through a 3-lens adversarial verification —
+correctness/repro, impact/exploitability, missed-mitigation. A finding
+survived only if at least two of three skeptics could NOT refute it.
+73 agents total. Surviving counts: **2 Critical · 18 High · 24 Medium ·
+8 Low**, with all 20 Critical/High findings surviving (19 at 3/3,
+TEST-2 at 2/3).
+
+The two Criticals were the launch blockers:
+
+- **SECURITY-1** — unauthenticated BOLA on `GET
+  /workspace/analyze-jobs/{job_id}` + `POST .../cancel`. The async-job
+  dict was looked up purely by id; the returned payload included
+  `artifacts.tailored_resume` + `cover_letter` (the PII-densest object
+  in the product). Fix: `owner_user_id` bound at start,
+  `Depends(get_required_auth_tokens)` on both routes, **404** (not
+  403) when not owner so existence isn't confirmed. (`87117f5`)
+- **CRITICAL-2** — the async `/analyze-jobs` path never called the
+  quota gate; a capped Free user got *"The agentic workflow failed
+  unexpectedly"* instead of the structured 429 + upgrade nudge. Fix:
+  run `enforce_llm_budget` synchronously **before** spawning the
+  worker, plus widen `_serialize_job` to round-trip the structured
+  `{code, counter, cap, tier, reset_period}` envelope so the polling
+  hook renders the existing upgrade CTA. (`d19030b`)
+
+Eight Highs landed alongside the Criticals: theme-entitlement scope
+(FLOW-3 — Free résumé export no longer blocked by an unrelated
+cover-letter theme, `17f160f`); browser-security baseline (FE-SEC-1 —
+CSP Report-Only, X-Frame-Options DENY, HSTS, X-Content-Type-Options,
+Referrer-Policy on the Next frontend, `69e36c8`); per-user
+in-flight-runs cap (BACKEND-2 — closes the concurrent-run
+weekly-token bypass and the fairness gap where one user's 5 runs
+locked out the process-wide semaphore, `fc6a8c4`); a cost-attribution
+chokepoint (LLM-1 + OBS-1 — `web_search` routed through
+`OpenAIService` so it meters and cost-traces; `_record_cost_trace`
+falls back to the `meter_user_scope` ContextVar so JD / résumé parser
++ embedding spend finally lands in `aijobagent_run_traces`,
+`8cdbc38`); two missing PostHog funnel events (OBS-2 — `jd_parsed` +
+`resume_built`, plugging the hole between `job_searched` and
+`analysis_started`, `d064241`); two render-storm fixes (PERF-1 +
+PERF-2 — assistant streaming state moved out of `WorkspaceShell`;
+`buildJobReview` memoized; `b-canvas` children `React.memo`-d, so a
+multi-paragraph answer no longer drives hundreds of whole-tree
+reconciliations and JD keystrokes no longer re-parse the multi-KB JD
+on every character, `f870667`); a shared accessible-dialog primitive
+(A11Y-1 + A11Y-2 — `useAccessibleDialog` with focus trap, initial
+focus, Escape, focus restore, applied to the ⌘K palette + assistant
+FAB; palette also gets combobox/listbox semantics, `6b454c6`); and a
+Vitest baseline wired into CI (TEST-1 — 5 coverage cases over
+`humanizeApiError`, `auth-session`, the workspace-quota hook, the
+tier-gate render, and `JDReview` submit wiring; CI frontend job now
+runs lint + build + test, `d376aac`).
+
+Deferred from this PR by deliberate decision (parked in `report.md`):
+H1 (upgrade CTAs all point at `/pricing` which 404s — gated on
+payment going live); PERFDB-1/2/3/4 (four 1000-row time-bombs:
+`cleanup_missing` can hard-delete a bookmarked row, unpaginated
+missing-row enumeration, the workspace-retention sweeper's N+1 +
+1000-row cap, and the `cached_jobs` DDL only living in the live DB —
+acceptable pre-traction, will bite around the thousandth user);
+TEST-2 (`tests/quality/` runners aren't collected by pytest, so a
+prompt edit can silently degrade tailoring/review quality with CI
+green — defer until there's a hermetic no-live-key path).
+
+Verification: 502 backend pytest, Vitest baseline, tsc + eslint clean
+on touched files. Merge: `a868b24`. Live-API smoke after deploy
+confirmed `/workspace/analyze-jobs/<fake>` no auth → 401 (SECURITY-1
+enforced), and the security headers landed on both the app subdomain
+and the marketing site.
+
+## Day 80: Medium + Low cleanup — 24 + 8 from the same audit
+
+Cleanup PR for everything the launch PR scoped out. Three phases,
+thirteen commits on the feature branch, merge `507cb3f`. Verification:
+**980 backend pytest** (up from 502 — this PR adds substantial new
+test coverage), 33 Vitest, clean production build.
+
+Phase 1 — Tier-1 Mediums:
+
+- **M1** — users could PATCH `app_users.plan_tier` / `account_status`
+  through their own JWT because the RLS UPDATE policy was
+  `using/with check (auth.uid() = id)` with no column restriction,
+  and the legacy daily-quota path read `app_users.plan_tier`. Now a
+  BEFORE-UPDATE trigger rejects non-`service_role` writes to those
+  columns, and `get_daily_quota_for_plan` reads from
+  `resolve_user_tier` (which sources `aijobagent_subscriptions`).
+  (`36c2aa8`)
+- **M5–M10, M14, M16** — iframe `sandbox=""` on the preview surfaces;
+  session-replay PII masked via privacy-by-default;
+  `safeRedirect`/`isAllowedRedirect` allowlist on every backend-
+  supplied URL the client navigates to; JD auto-parse 429 notices
+  now surface an inline upgrade CTA; clear-then-repaste resets
+  `lastParsedTextRef` so the LLM-parsed panels return on retry;
+  `handleSignOut` resets workspace content slices so isolation holds
+  even without the hard-nav backstop; account popover restores focus
+  and ditches the wrong-widget `role="menu"` for a labelled
+  disclosure; debounced JD auto-parse threads `AbortSignal` through
+  `request()` so a superseded LLM parse actually cancels.
+  (`d42332b`)
+
+Phase 2 — Tier-2 Mediums (three sub-commits, domain-coherent):
+
+- **Backend correctness + coverage** (`c18109b`): atomic
+  `save_saved_job` RPC closes the count-then-upsert TOCTOU on the
+  persistent saved-jobs cap (M2); `/workspace/quota`'s
+  `_persistent_count()` uses a new `count_active()` head-read
+  instead of deserializing the fat saved-workspace blob on every
+  mount-and-after-every-run poll (M4); `POST /billing/portal` got
+  tests across all six outcomes (M18); `saved_workspaces` per-tier
+  cap pinned to **1/1/1** (M19 — the schema is one-row-per-user, so
+  the cap+1 case was structurally unenforceable; multi-row history
+  flagged as future enhancement).
+- **Observability + anon attribution** (`11eb8c5`): backend events
+  for unauthenticated callers now use the browser's PostHog distinct
+  id via a new `X-PostHog-Distinct-Id` request header (M21 — closes
+  the funnel hole where every anon visitor mapped to one literal
+  `"anonymous"` person and anonymous→signup conversion couldn't be
+  computed); the retention sweeper got its `sentry_cron_monitor`
+  (`saved-workspaces-retention`) so a stuck cron pages instead of
+  silently leaving Free data past its 7-day retention promise (M22);
+  Sentry breadcrumbs / tags / context / user are now set on each
+  analysis stage and on the export route, defeating the
+  AI-Agents-Monitoring blind spot ADR-024 was adopted for (M23).
+- **Frontend perf + UX** (`1a3bc69`): job-grid memoized via
+  `React.memo(JobCard)` + stabilized per-card callbacks (M11 — full
+  `JobSearch` memo benefit + virtualization deferred); session
+  replay is route-gated to marketing pages (M12 — `posthog-js` has
+  no client `session_recording` sample rate, so route gating is the
+  available knob); `--fg-4` lightened to a contrast-passing token
+  for its four text-uses (M13); dead `BackendHealth` type +
+  `getBackendHealth` deleted (M17 — no caller; better to remove than
+  fix the drift); JD paste no longer collapses the input textarea
+  ~1.5s after a paste (M24).
+
+Phase 3 — Lows, one commit per finding: `/health/sentry-debug` gated
+behind the admin bearer secret so anyone curling it stops burning
+Sentry quota (L1, `2987364`); `fetch_github_readme` sets
+`allow_redirects=False` so the SSRF-adjacent surface disappears (L2,
+`1cb5a63`); a regression test pins the `web_search` 30s timeout the
+launch PR already shipped via LLM-1 (L4, `b7f2884`); completed
+analysis jobs drop `job.result` on the first terminal get +
+`JOB_TTL_SECONDS` tightened 1800→600 (L3, `cf6f8f4`); the "Parsing
+JD…" indicator is gated on the current `AbortController` so a
+superseded request's `finally` doesn't hide the busy hint while a
+newer parse is still in flight (L5, `a4239c8`); the VoiceInputButton
+reduced-motion override is driven off a class instead of a brittle
+`[style*="animation"]` substring selector (L6, `7035d2f`); the dead
+non-streaming `askWorkspaceAssistant` client fn is gone but the
+backend `/workspace/assistant/answer` route stays as a tested
+lockstep fallback — the report's "dead" framing was inaccurate; the
+route shares the metered `answer_workspace_question` path (L7,
+`064532c`); auth-cookie tests now assert `Secure`, `SameSite`, and
+clear-scope so a config refactor dropping any of those would fail
+(L8, `3ec4b6a`).
+
+Deferred from the cleanup (parked in `report.md` alongside the launch
+PR's deferrals): **M3** (process-global run-concurrency cap with no
+per-user fairness — effectively addressed by BACKEND-2's per-user
+cap; the architectural piece is Rec #1/#3 territory); **M15, M20**
+(export + streaming-assistant 429 upgrade CTAs — blocked on the same
+`/pricing` destination that gates H1); **M19 multi-row workspaces**
+(single-slot is shipped reality; multi-row needs a schema migration
+plus un-deferring the structural-enforcement test); **M11
+follow-ups** (wrap `WorkspaceShell`'s `JobSearch` callbacks in
+`useCallback` to fully activate the memo boundary, and add grid
+virtualization — needs a windowing dep); **L3 follow-up** (an
+optional periodic prune timer; the terminal-get drop + lowered TTL
+already bound resident memory).
+
+Plus the five **Architectural Recommendations** from the audit report
+(R1 async-as-transparent-transport; R2 `OpenAIService` as the only
+door; R3 per-user authZ + HTTP security to enforced edges; R4
+paginated maintenance scans + tracked `cached_jobs` migration; R5
+shared accessible-overlay primitive + workspace shell split + CI test
+tier) — some are partially complete after the launch + cleanup PRs,
+the "architecture" half of each is parked. All five documented in
+`report.md`.
+
+Live smoke post-deploy: `GET /health/sentry-debug` no auth → **401**
+(was 500 before — proves both the L1 fix and the deploy on `507cb3f`),
+`GET /workspace/analyze-jobs/<fake>` no auth → 401 (SECURITY-1 still
+enforced), security headers still healthy on both subdomains, 31/31
+hermetic new cleanup tests pass locally.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -72,6 +72,8 @@ Owns the FastAPI API surface:
 - `backend/routers/billing.py` owns the HMAC-verified `POST /webhooks/lemonsqueezy` subscription-event endpoint + the customer-portal redirect; the signature-verification + event-routing logic lives in `backend/webhooks/lemonsqueezy.py`
 - `backend/prompt_registry.py` loads every LLM prompt from `prompts/<name>/<version>.json` — all 11 builders migrated off Python f-string concats; see [ADR-018](adr/ADR-018-three-layer-llm-retry-and-per-agent-fallback-isolation.md) family + the prompt-registry DEVLOG entries
 - `backend/services/job_cache_service.py` runs the per-source refresh + smart-cleanup worker invoked by the admin endpoint
+- `backend/services/workspace_run_jobs.py` owns the async `/analyze-jobs` job system. Each `WorkspaceRunJob` is bound to its `owner_user_id` at start time and the status/cancel routes check it (returning **404** — same code for "unknown" and "not yours" — so existence isn't confirmed); the quota gate runs **synchronously before** the worker is spawned, with the structured `{code, counter, cap, tier, reset_period}` envelope round-tripped through `_serialize_job` so the polling hook renders the same 429 upgrade CTA the sync path does; a per-user in-flight cap (1 run/user) sits in front of the process-global `BoundedSemaphore(5)` so one user's burst can't 503 every other account. The launch-readiness pass that introduced these guarantees is DEVLOG Day 79
+- `backend/routers/health.py` also hosts `/health/sentry-debug` — now gated behind the admin bearer secret so an unauthenticated curl gets a 401 instead of a `ZeroDivisionError` that would burn Sentry quota (DEVLOG Day 80)
 
 ### `src/services/`
 
@@ -221,6 +223,8 @@ Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. T
 
 `aijobagent_feedback` holds one row per artifact thumbs-up/down (`user_id`, `workspace_id`, `artifact_kind`, `rating`, `comment`, `created_at`), RLS-scoped to the owning user; admin reads go through the service role.
 
+A small set of structural reinforcements landed during the launch-readiness cleanup (DEVLOG Day 80) that are worth flagging here because they're load-bearing on the entitlement and read-fast paths: (1) a BEFORE-UPDATE trigger on `app_users` rejects non-`service_role` writes to `plan_tier` / `account_status`, so the unrestricted RLS UPDATE policy can no longer be abused to PATCH one's own tier; the legacy daily-quota path now sources tier from `resolve_user_tier` (which reads `aijobagent_subscriptions`) instead of `app_users.plan_tier`; (2) `save_saved_job` is now an atomic SECURITY DEFINER RPC that count-and-inserts in one transaction (advisory lock), closing the TOCTOU window where two concurrent saves at count=cap−1 could both pass and exceed the persistent cap; (3) `/workspace/quota`'s `_persistent_count()` no longer reads the fat `saved_workspaces` blob — a `count_active(user_id)` head-read returns 0/1 without deserializing `workflow_snapshot_json` / `cover_letter_payload_json` / `tailored_resume_payload_json`. `saved_workspaces` per-tier caps are pinned to **1/1/1** because the schema is one-row-per-user (multi-row history is flagged as a future enhancement requiring a schema migration).
+
 ## Observability And Telemetry Layer
 
 Wired Day 46. The compliance posture is enforced at the SDK-init level, not as legalese on a privacy page — see [ADR-024](adr/ADR-024-observability-stack-sentry-and-posthog.md) and [ADR-025](adr/ADR-025-eu-cookie-consent-banner-and-gdpr-analytics-gating.md).
@@ -232,6 +236,8 @@ Two vendors, one bootstrap path:
 
 Both clients are no-ops when their DSN / key is empty, so dev, CI, and the test suite run without observability wiring or network calls.
 
+The launch-readiness cleanup (DEVLOG Day 80) added three reinforcements to this surface: (1) Sentry breadcrumbs / tags / context / user are now set on each pipeline stage in `src/agents/orchestrator.py` (via the stage-boundary callback, not the orchestrator internals) and on the export route, so a mid-pipeline 5xx is localizable to the failing agent — defeating the AI-Agents-Monitoring blind spot ADR-024 was adopted for; (2) the `saved-workspaces-retention` sweeper got its `sentry_cron_monitor` wrapper so a stuck retention cron now pages instead of silently leaving Free data past its 7-day retention promise; (3) backend events emitted by unauthenticated callers now carry the browser's PostHog distinct id via a new `X-PostHog-Distinct-Id` request header — the previous `"anonymous"` constant collapsed every anon visitor onto one PostHog person and made anonymous→signup conversion uncomputable.
+
 ### Consent gating
 
 The single source of truth is `localStorage["jobagent-cookie-consent"]`, set by the custom in-house cookie banner (`frontend/src/components/cookie-consent.tsx`), three states: `pending` / `accepted` / `declined`. The split:
@@ -245,6 +251,20 @@ A `jobagent-cookie-consent-change` custom event re-evaluates the gated integrati
 
 A Sentry Uptime monitor pings `https://api.job-application-copilot.xyz/health` every 5 minutes from the EU region. Configured in the Sentry dashboard rather than in code — a fresh-project rebuild must recreate it manually.
 
+## Browser security baseline
+
+The Next.js app sends a fixed set of response headers on every route, configured via `headers()` in `frontend/next.config.ts`. The defense-in-depth posture is the same on the marketing site and the workspace subdomain:
+
+- **`X-Frame-Options: DENY`** + **`Content-Security-Policy: frame-ancestors 'none'`** — clickjacking defense. The workspace can't be framed and overlaid to trick a signed-in user into destructive actions; SameSite=Lax cookies would otherwise ride along on top-level navigation.
+- **`Strict-Transport-Security: max-age=63072000; includeSubDomains; preload`** — HTTPS for two years across all subdomains, preload-eligible.
+- **`X-Content-Type-Options: nosniff`** — disables MIME-type sniffing on responses (resource loaders honor the declared `Content-Type`).
+- **`Referrer-Policy: strict-origin-when-cross-origin`** — strips path + query from the Referer on cross-origin navigation while keeping it intact within the site.
+- **`Content-Security-Policy`** as Report-Only for the first weeks of public traffic — same-origin defaults plus the actual allowlist (PostHog `eu.i.posthog.com`, Sentry `*.sentry.io`, Lemon Squeezy, Supabase `*.supabase.co`). Tuning to enforce-mode tracks violation reports in Sentry.
+
+The launch-readiness pass that introduced this baseline is DEVLOG Day 79 (FE-SEC-1). Backend-side, every backend-supplied redirect URL the client navigates to passes through an explicit allowlist (`frontend/src/lib/redirectAllowlist.ts` — `safeRedirect` / `isAllowedRedirect`) so the OAuth handoff + workspace-shell redirects can't be steered to an attacker-controlled origin (DEVLOG Day 80, M7).
+
+The accessible-overlay primitive (`frontend/src/lib/useAccessibleDialog.ts`) is the shared focus-trap + initial-focus + Escape + focus-restore contract behind every modal surface in the workspace shell — the ⌘K command palette and the assistant FAB use it directly; the palette also gets combobox/listbox semantics (`role="combobox"` + `aria-expanded` + `aria-controls` + `aria-activedescendant`, list `role="listbox"`, items `role="option"` with `aria-selected`). DEVLOG Day 79 (A11Y-1/A11Y-2).
+
 ## Testing Model
 
 The repo includes focused tests for: