docs(adr): ADR-033 — hybrid lexical + semantic job search

LEANDERANTONY · claude · LEANDERANTONY · commit db2ddba19f97 · 2026-05-22T02:21:27.000+05:30
Records the three-tier job-search relevance architecture shipped over
DEVLOG Days 68-75: Tier 1 deterministic synonym/abbreviation query
expansion, Tier 2 pgvector + Reciprocal Rank Fusion hybrid search,
Tier 3 learned ranker explicitly out of scope pre-revenue. Extends
ADR-014's lexical RPC. Indexes ADR-033 in the Core cluster, adds a
job-search line to the current-state note, cross-links architecture.md,
and bumps the README ADR count to 32.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -77,7 +77,7 @@ Each agent follows the same operating shape: deterministic baseline first, LLM-a
 - **74 Python test files** cover parsing, normalization, fitting, tailoring, orchestration, builders, exports, auth, quotas, persistence, the Lemon Squeezy webhook, voice transcription, artifact feedback, prompt-registry byte-identity, error handling, hybrid job search, and the four ATS adapters.
 - **Quality runners** in `tests/quality/` produce evidence for each LLM-driven stage (parser, tailoring, review, resume gen, cover letter, assistant, JD parser, latency baseline). `backend/nightly_eval.py` wraps them into a single regression-checked batch — manual-only at pre-revenue stage by design, see [ADR-026](docs/adr/ADR-026-manual-only-nightly-eval-at-pre-revenue-stage.md).
 - **Every LLM prompt loads from a versioned JSON registry** (`prompts/<name>/v1.json`) — all 11 builders migrated off Python f-string concats, each guarded by a byte-identity test so a template can't silently drift from its original.
-- **31 ADRs** in `docs/adr/` record the architectural decisions, including the Streamlit-first → Next.js + FastAPI transition (ADR-012), DOCX-first export (ADR-015), conversational builder (ADR-016), state-aware assistant (ADR-017), three-layer retry stack (ADR-018), independent step navigation (ADR-019), tier resolution shim (ADR-020), atomic quota with refund (ADR-021), tier-aware model selection (ADR-022), Lemon Squeezy as Merchant of Record for v1 (ADR-023), the observability stack (ADR-024), the EU cookie consent banner (ADR-025), manual-only nightly eval (ADR-026), the tier-gated export entitlement (ADR-027), LLM provider failover + premium reasoning tier (ADR-028), the single-source `ThemeSpec` registry (ADR-029), the résumé-builder agentic architecture (ADR-031), and the six bespoke two-column résumé themes (ADR-032).
+- **32 ADRs** in `docs/adr/` record the architectural decisions, including the Streamlit-first → Next.js + FastAPI transition (ADR-012), DOCX-first export (ADR-015), conversational builder (ADR-016), state-aware assistant (ADR-017), three-layer retry stack (ADR-018), independent step navigation (ADR-019), tier resolution shim (ADR-020), atomic quota with refund (ADR-021), tier-aware model selection (ADR-022), Lemon Squeezy as Merchant of Record for v1 (ADR-023), the observability stack (ADR-024), the EU cookie consent banner (ADR-025), manual-only nightly eval (ADR-026), the tier-gated export entitlement (ADR-027), LLM provider failover + premium reasoning tier (ADR-028), the single-source `ThemeSpec` registry (ADR-029), the résumé-builder agentic architecture (ADR-031), the six bespoke two-column résumé themes (ADR-032), and hybrid lexical + semantic job search (ADR-033).
 - **Architecture details** live in [docs/architecture.md](docs/architecture.md); the day-2 operational runbook in [docs/deployment.md](docs/deployment.md).
 
 ## Deployment
diff --git a/docs/adr/ADR-033-hybrid-job-search.md b/docs/adr/ADR-033-hybrid-job-search.md
@@ -0,0 +1,170 @@
+# ADR-033: Hybrid Lexical + Semantic Job Search
+
+- Status: Accepted
+- Date: 2026-05-22
+
+## Context
+
+[ADR-014](ADR-014-postgres-rpc-for-ranked-search.md) shipped
+`search_cached_jobs_ranked` — lexical full-text search over the
+`cached_jobs` index ([ADR-013](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md))
+with `ts_rank`, filters, and sort. It is fast and precise on exact
+keywords, but lexical FTS has a structural ceiling: it can only rank a
+job that literally shares tokens with the query.
+
+A relevance audit of the ~14k-row corpus (DEVLOG Day 68 — a fixed
+query set scored against the results it returned) found the two
+failure modes that ceiling produces:
+
+- **Abbreviation / synonym misses.** "ml engineer" does not match a
+  posting titled "Machine Learning Engineer"; "frontend" misses
+  "React Developer"; "k8s" misses "Kubernetes".
+- **Conceptual misses.** A job that is a strong fit but shares no
+  surface tokens with the query never surfaces at all.
+
+ADR-014's own follow-up anticipated this ("if the cache grows … add a
+tsvector index … consider pg_trgm"). The decision here is the
+relevance upgrade itself.
+
+## Decision
+
+**A three-tier relevance design. Tiers 1 and 2 are shipped; Tier 3 is
+explicitly out of scope at this stage.**
+
+### Tier 1 — deterministic synonym / abbreviation query expansion
+
+`src/job_search_synonyms.py` `expand_query()` rewrites the raw user
+query into a `to_tsquery`-syntax boolean expression before it reaches
+Postgres. "ml engineer" becomes
+`(ml | machine<->learning) & (engineer | developer | dev)`. The
+synonym map is curated from the corpus's own vocabulary (DEVLOG
+Day 68); each query token expands to an OR-group of its known
+equivalents, and the groups are AND-ed.
+
+- **Deterministic, no LLM, no added latency.** Tier 1 is a pure string
+  transform — it cannot fail, cost money, or slow a search down.
+- The RPC parses the result with `to_tsquery`, not the
+  `websearch_to_tsquery` ADR-014 used, because the expanded string is
+  already operator-decorated. Empty / all-stopword input expands to
+  `''`, which the RPC treats as "no lexical filter".
+
+### Tier 2 — hybrid lexical + semantic search with RRF
+
+Lexical search, even synonym-expanded, still only matches declared
+equivalents. Tier 2 adds a semantic retriever and fuses the two.
+
+1. **pgvector embedding column.** `cached_jobs` gains
+   `embedding vector(1536)` (`text-embedding-3-small`) with an HNSW
+   cosine index.
+2. **A new `search_cached_jobs_hybrid` RPC** runs two retrievers, each
+   a top-N query over `cached_jobs` so the index drives candidate
+   selection: a lexical pool (`ts_rank` over the GIN index) and a
+   semantic pool (cosine distance `<=>` over the HNSW index).
+3. **Reciprocal Rank Fusion.** The two pools are fused on their
+   *rankings*, not their raw scores: `rrf = 1/(k+lex_rank) +
+   1/(k+sem_rank)`, `k = 60`. `ts_rank` and cosine distance live on
+   incomparable scales; RRF sidesteps normalization entirely — a job
+   ranked highly by *either* signal surfaces.
+4. **Embeddings are produced two ways.** A one-time corpus backfill
+   (`scripts/backfill_job_embeddings.py`) seeds the existing rows;
+   embed-on-write embeds *newly-cached* jobs during the 4-hour refresh
+   (only new rows — see DEVLOG Day 75).
+5. **Gated and graceful.** The hybrid path is behind the
+   `JOB_SEARCH_HYBRID_ENABLED` flag. The query embedding is computed
+   backend-side; on *any* failure (flag off, no OpenAI key, embedding
+   error, RPC error) the store falls back to the Tier 1 lexical RPC.
+   Search never hard-fails because of Tier 2.
+
+The hybrid RPC keeps ADR-014's posture: `SECURITY DEFINER`,
+`SET search_path = public`, `EXECUTE` granted to `service_role` only.
+
+### Tier 3 — learned ranker — out of scope
+
+A learned re-ranker trained on click / save / apply signals is the
+natural Tier 3. It is deliberately not built: pre-revenue, there is no
+interaction data to train on and no labels. RRF is a strong,
+zero-training baseline that a Tier 3 ranker would later refine, not
+replace.
+
+## Alternatives Considered
+
+### 1. Stay pure lexical (synonym expansion only)
+Rejected as the endpoint. Tier 1 alone closes the abbreviation gap but
+not the conceptual one — it can only match equivalences someone
+thought to add to the map. It ships as Tier 1 *inside* this design,
+not instead of it.
+
+### 2. Pure semantic search (replace lexical)
+Rejected. Embedding similarity drifts off precise terms — an exact
+title or company query underperforms, and rare tokens get washed out.
+Lexical precision and semantic recall are complementary; dropping
+either loses real results.
+
+### 3. Weighted score blending instead of RRF
+Rejected. Blending `ts_rank` and cosine distance needs both on a
+common scale; any fixed normalization is a guess that drifts as the
+corpus changes. RRF fuses ranks, which are already comparable, and is
+the documented production default for hybrid retrieval.
+
+### 4. A managed vector database (Pinecone / Weaviate)
+Rejected. pgvector keeps the vectors in the same Postgres that already
+holds `cached_jobs` — one datastore, one backup, one access path, the
+same `service_role` RPC posture. A separate vector service adds infra,
+cost, and a second consistency problem for no gain at this scale.
+
+### 5. IVFFlat index instead of HNSW
+Rejected. IVFFlat needs a training pass over a populated table and
+re-tuning as the corpus grows. HNSW is correct immediately — which
+matters because the `embedding` column is backfilled *after* the index
+is created.
+
+## Consequences
+
+### Positive
+
+- Recall improves on both failure modes — abbreviations match their
+  long forms, and conceptually-related jobs surface even with zero
+  shared tokens.
+- Graceful degradation is structural: hybrid is one flag, and every
+  Tier 2 failure path falls back to the proven Tier 1 lexical RPC.
+- The vector layer is pure Postgres — no new infrastructure, no second
+  datastore.
+
+### Negative
+
+- An OpenAI embedding cost: a one-time corpus backfill (the
+  \$0.25–0.50 range estimated in DEVLOG Day 70) plus embed-on-write
+  for new jobs each refresh (cents/day). Small, but the refresh path
+  is no longer strictly \$0 — see `deployment.md`.
+- The hybrid path adds a query-embedding round trip (~200–500 ms), and
+  the HNSW index adds write cost to the refresh upserts. The Day 75
+  incident — re-embedding the whole corpus every refresh churned the
+  index and timed the refresh out — is the cautionary tail of that
+  write cost; the fix bounds embed-on-write to genuinely new rows.
+- Search logic now spans Python (synonym expansion, query embedding,
+  fallback orchestration) and two SQL RPCs. The Tier 1 RPC is retained
+  as both the fallback and the hybrid-disabled path, so the contract
+  surface is two RPCs, not one.
+
+### Neutral
+
+- `JOB_SEARCH_HYBRID_ENABLED` is the operational switch — Tier 2 can
+  be turned off without a deploy if the semantic side ever misbehaves.
+- The hybrid RPC was revised once post-launch: v1 ranked both sides
+  with window functions over the full corpus and hit the statement
+  timeout; v2 uses HNSW / GIN candidate pools (DEVLOG Day 74).
+
+## References
+
+- [ADR-013](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md)
+  — the `cached_jobs` cache layer this search reads from.
+- [ADR-014](ADR-014-postgres-rpc-for-ranked-search.md) — the Tier 1
+  lexical `search_cached_jobs_ranked` RPC this extends and falls back
+  to.
+- DEVLOG Days 68 (Tier 1), 70 (Tier 2), 74 (hybrid RPC rewrite), 75
+  (embed-on-write fix).
+- SQL: `docs/sql/supabase-cached-jobs-search.sql` (Tier 1),
+  `supabase-cached-jobs-pgvector.sql` (embedding column + HNSW index),
+  `supabase-cached-jobs-hybrid.sql` (hybrid RPC).
+- Code: `src/job_search_synonyms.py`, `src/cached_jobs_store.py`,
+  `scripts/backfill_job_embeddings.py`.
diff --git a/docs/adr/README.md b/docs/adr/README.md
@@ -19,6 +19,7 @@ The accepted set is grouped into four thematic clusters to make the current prod
 - [ADR-012: Next.js workspace and FastAPI runtime baseline](ADR-012-nextjs-workspace-and-fastapi-runtime-baseline.md)
 - [ADR-013: Cached jobs cache layer with scheduled refresh](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md)
 - [ADR-014: Postgres RPC for ranked job search](ADR-014-postgres-rpc-for-ranked-search.md)
+- [ADR-033: Hybrid lexical + semantic job search](ADR-033-hybrid-job-search.md) — extends ADR-014's lexical RPC to a three-tier relevance design: deterministic synonym / abbreviation query expansion (Tier 1) + a pgvector semantic retriever fused with the lexical ranking via Reciprocal Rank Fusion (Tier 2, `search_cached_jobs_hybrid`), gated behind `JOB_SEARCH_HYBRID_ENABLED` with graceful fallback to lexical. Tier 3 (learned ranker) out of scope pre-revenue
 - [ADR-015: DOCX-first artifact export with theme palette](ADR-015-docx-first-artifact-export-with-theme-palette.md)
 - [ADR-029: ThemeSpec single-source + color-theme expansion](ADR-029-themespec-single-source-and-color-theme-expansion.md) — realises ADR-015's typed-`ThemeSpec` follow-up; one registry derives résumé + cover-letter + DOCX palettes + the backend gate set; first new theme `modern_blue` (single-column, ATS-safe). Two-column "presentation" layout reserved for a later gated phase — since realised by ADR-032
 - [ADR-032: Six bespoke two-column résumé themes](ADR-032-two-column-resume-themes.md) — extends ADR-029's `ThemeSpec` registry with six distinct two-column résumé layouts via a `twocol_layout` discriminator, retiring the single placeholder `presentation_twocol` slot; the registry now ships 12 themes (6 single-column ATS-safe + 6 two-column)
@@ -57,7 +58,7 @@ The accepted set is grouped into four thematic clusters to make the current prod
 
 ## Current state note
 
-As of 2026-05-22, the shipped product is a Next.js workspace deployed on Vercel backed by a FastAPI container on a Frankfurt VPS, with a Supabase EU project for Auth + persistence + the cached-jobs index. The agentic workflow runs Tailoring → Review → ResumeGen → CoverLetter on every analysis, with per-agent retry + fallback isolation. Tier enforcement is live across eight counters (Free / Pro / Business) with the Lemon Squeezy payment scaffold env-gated behind a "Coming soon" frontend fallback until the dashboard's final variant IDs land; export format/theme is additionally an entitlement gate (Free = PDF + `professional_neutral`, which is also the product-wide default theme; Pro/Business unlock DOCX + every non-default theme) enforced server-side via the same 429 upgrade path (ADR-027). Theming is one typed `ThemeSpec` registry (ADR-029, ADR-032) deriving the résumé, cover-letter, and DOCX palettes plus the backend supported-theme set; it ships 12 themes — 6 single-column (ATS-safe) and 6 two-column — and its `layout` / `twocol_layout` discriminators drive the renderer, so adding a theme is one registry entry. The observability stack (Sentry `jobagent-backend` + `jobagent-frontend` + a shared PostHog free-tier project tagged with `product: "jobagent"`) is wired with a custom EU cookie consent banner gating PostHog + Sentry Session Replay behind explicit user opt-in; consent persists in a parent-domain cookie so it is honoured across the marketing apex and the `app.` subdomain. `backend/nightly_eval.py` exists and is tested but is **not** on the production cron at pre-revenue stage — re-enabling is a single crontab edit when revenue justifies the recurring LLM spend.
+As of 2026-05-22, the shipped product is a Next.js workspace deployed on Vercel backed by a FastAPI container on a Frankfurt VPS, with a Supabase EU project for Auth + persistence + the cached-jobs index. The agentic workflow runs Tailoring → Review → ResumeGen → CoverLetter on every analysis, with per-agent retry + fallback isolation. Tier enforcement is live across eight counters (Free / Pro / Business) with the Lemon Squeezy payment scaffold env-gated behind a "Coming soon" frontend fallback until the dashboard's final variant IDs land; export format/theme is additionally an entitlement gate (Free = PDF + `professional_neutral`, which is also the product-wide default theme; Pro/Business unlock DOCX + every non-default theme) enforced server-side via the same 429 upgrade path (ADR-027). Theming is one typed `ThemeSpec` registry (ADR-029, ADR-032) deriving the résumé, cover-letter, and DOCX palettes plus the backend supported-theme set; it ships 12 themes — 6 single-column (ATS-safe) and 6 two-column — and its `layout` / `twocol_layout` discriminators drive the renderer, so adding a theme is one registry entry. Job search over the `cached_jobs` index is hybrid (ADR-033) — deterministic synonym expansion plus a pgvector semantic ranking fused with the lexical full-text ranking via Reciprocal Rank Fusion, gated behind `JOB_SEARCH_HYBRID_ENABLED` and degrading to lexical-only on any failure. The observability stack (Sentry `jobagent-backend` + `jobagent-frontend` + a shared PostHog free-tier project tagged with `product: "jobagent"`) is wired with a custom EU cookie consent banner gating PostHog + Sentry Session Replay behind explicit user opt-in; consent persists in a parent-domain cookie so it is honoured across the marketing apex and the `app.` subdomain. `backend/nightly_eval.py` exists and is tested but is **not** on the production cron at pre-revenue stage — re-enabling is a single crontab edit when revenue justifies the recurring LLM spend.
 
 ## Adding a new ADR
 
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -215,7 +215,7 @@ Each `saved_jobs` row stores one shortlisted posting per user and normalized job
 
 Each `resume_builder_sessions` row stores one in-progress conversational resume-builder draft per user with a 7-day TTL refreshed on every save. A `pg_cron` job (`cleanup-expired-resume-builder-sessions`) hard-deletes expired rows every 5 min and RLS hides expired rows from per-user queries; see [ADR-016](adr/ADR-016-conversational-llm-resume-builder.md).
 
-Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure.
+Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md).
 
 `aijobagent_run_traces` is an append-only cost-attribution table — one row per successful LLM call (`user_id`, `model`, `task`, `prompt_tokens`, `completion_tokens`, `cost_usd`, `created_at`). Writes are best-effort: a missing table or a write error never propagates to the user-facing path. It is the canonical answer to "what is OpenAI spend doing", separate from the Sentry/PostHog telemetry surface.