Skip to content

Commit db2ddba

Browse files
LEANDERANTONYclaude
andcommitted
docs(adr): ADR-033 — hybrid lexical + semantic job search
Records the three-tier job-search relevance architecture shipped over DEVLOG Days 68-75: Tier 1 deterministic synonym/abbreviation query expansion, Tier 2 pgvector + Reciprocal Rank Fusion hybrid search, Tier 3 learned ranker explicitly out of scope pre-revenue. Extends ADR-014's lexical RPC. Indexes ADR-033 in the Core cluster, adds a job-search line to the current-state note, cross-links architecture.md, and bumps the README ADR count to 32. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0cc4856 commit db2ddba

4 files changed

Lines changed: 174 additions & 3 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Each agent follows the same operating shape: deterministic baseline first, LLM-a
7777
- **74 Python test files** cover parsing, normalization, fitting, tailoring, orchestration, builders, exports, auth, quotas, persistence, the Lemon Squeezy webhook, voice transcription, artifact feedback, prompt-registry byte-identity, error handling, hybrid job search, and the four ATS adapters.
7878
- **Quality runners** in `tests/quality/` produce evidence for each LLM-driven stage (parser, tailoring, review, resume gen, cover letter, assistant, JD parser, latency baseline). `backend/nightly_eval.py` wraps them into a single regression-checked batch — manual-only at pre-revenue stage by design, see [ADR-026](docs/adr/ADR-026-manual-only-nightly-eval-at-pre-revenue-stage.md).
7979
- **Every LLM prompt loads from a versioned JSON registry** (`prompts/<name>/v1.json`) — all 11 builders migrated off Python f-string concats, each guarded by a byte-identity test so a template can't silently drift from its original.
80-
- **31 ADRs** in `docs/adr/` record the architectural decisions, including the Streamlit-first → Next.js + FastAPI transition (ADR-012), DOCX-first export (ADR-015), conversational builder (ADR-016), state-aware assistant (ADR-017), three-layer retry stack (ADR-018), independent step navigation (ADR-019), tier resolution shim (ADR-020), atomic quota with refund (ADR-021), tier-aware model selection (ADR-022), Lemon Squeezy as Merchant of Record for v1 (ADR-023), the observability stack (ADR-024), the EU cookie consent banner (ADR-025), manual-only nightly eval (ADR-026), the tier-gated export entitlement (ADR-027), LLM provider failover + premium reasoning tier (ADR-028), the single-source `ThemeSpec` registry (ADR-029), the résumé-builder agentic architecture (ADR-031), and the six bespoke two-column résumé themes (ADR-032).
80+
- **32 ADRs** in `docs/adr/` record the architectural decisions, including the Streamlit-first → Next.js + FastAPI transition (ADR-012), DOCX-first export (ADR-015), conversational builder (ADR-016), state-aware assistant (ADR-017), three-layer retry stack (ADR-018), independent step navigation (ADR-019), tier resolution shim (ADR-020), atomic quota with refund (ADR-021), tier-aware model selection (ADR-022), Lemon Squeezy as Merchant of Record for v1 (ADR-023), the observability stack (ADR-024), the EU cookie consent banner (ADR-025), manual-only nightly eval (ADR-026), the tier-gated export entitlement (ADR-027), LLM provider failover + premium reasoning tier (ADR-028), the single-source `ThemeSpec` registry (ADR-029), the résumé-builder agentic architecture (ADR-031), the six bespoke two-column résumé themes (ADR-032), and hybrid lexical + semantic job search (ADR-033).
8181
- **Architecture details** live in [docs/architecture.md](docs/architecture.md); the day-2 operational runbook in [docs/deployment.md](docs/deployment.md).
8282

8383
## Deployment
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# ADR-033: Hybrid Lexical + Semantic Job Search
2+
3+
- Status: Accepted
4+
- Date: 2026-05-22
5+
6+
## Context
7+
8+
[ADR-014](ADR-014-postgres-rpc-for-ranked-search.md) shipped
9+
`search_cached_jobs_ranked` — lexical full-text search over the
10+
`cached_jobs` index ([ADR-013](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md))
11+
with `ts_rank`, filters, and sort. It is fast and precise on exact
12+
keywords, but lexical FTS has a structural ceiling: it can only rank a
13+
job that literally shares tokens with the query.
14+
15+
A relevance audit of the ~14k-row corpus (DEVLOG Day 68 — a fixed
16+
query set scored against the results it returned) found the two
17+
failure modes that ceiling produces:
18+
19+
- **Abbreviation / synonym misses.** "ml engineer" does not match a
20+
posting titled "Machine Learning Engineer"; "frontend" misses
21+
"React Developer"; "k8s" misses "Kubernetes".
22+
- **Conceptual misses.** A job that is a strong fit but shares no
23+
surface tokens with the query never surfaces at all.
24+
25+
ADR-014's own follow-up anticipated this ("if the cache grows … add a
26+
tsvector index … consider pg_trgm"). The decision here is the
27+
relevance upgrade itself.
28+
29+
## Decision
30+
31+
**A three-tier relevance design. Tiers 1 and 2 are shipped; Tier 3 is
32+
explicitly out of scope at this stage.**
33+
34+
### Tier 1 — deterministic synonym / abbreviation query expansion
35+
36+
`src/job_search_synonyms.py` `expand_query()` rewrites the raw user
37+
query into a `to_tsquery`-syntax boolean expression before it reaches
38+
Postgres. "ml engineer" becomes
39+
`(ml | machine<->learning) & (engineer | developer | dev)`. The
40+
synonym map is curated from the corpus's own vocabulary (DEVLOG
41+
Day 68); each query token expands to an OR-group of its known
42+
equivalents, and the groups are AND-ed.
43+
44+
- **Deterministic, no LLM, no added latency.** Tier 1 is a pure string
45+
transform — it cannot fail, cost money, or slow a search down.
46+
- The RPC parses the result with `to_tsquery`, not the
47+
`websearch_to_tsquery` ADR-014 used, because the expanded string is
48+
already operator-decorated. Empty / all-stopword input expands to
49+
`''`, which the RPC treats as "no lexical filter".
50+
51+
### Tier 2 — hybrid lexical + semantic search with RRF
52+
53+
Lexical search, even synonym-expanded, still only matches declared
54+
equivalents. Tier 2 adds a semantic retriever and fuses the two.
55+
56+
1. **pgvector embedding column.** `cached_jobs` gains
57+
`embedding vector(1536)` (`text-embedding-3-small`) with an HNSW
58+
cosine index.
59+
2. **A new `search_cached_jobs_hybrid` RPC** runs two retrievers, each
60+
a top-N query over `cached_jobs` so the index drives candidate
61+
selection: a lexical pool (`ts_rank` over the GIN index) and a
62+
semantic pool (cosine distance `<=>` over the HNSW index).
63+
3. **Reciprocal Rank Fusion.** The two pools are fused on their
64+
*rankings*, not their raw scores: `rrf = 1/(k+lex_rank) +
65+
1/(k+sem_rank)`, `k = 60`. `ts_rank` and cosine distance live on
66+
incomparable scales; RRF sidesteps normalization entirely — a job
67+
ranked highly by *either* signal surfaces.
68+
4. **Embeddings are produced two ways.** A one-time corpus backfill
69+
(`scripts/backfill_job_embeddings.py`) seeds the existing rows;
70+
embed-on-write embeds *newly-cached* jobs during the 4-hour refresh
71+
(only new rows — see DEVLOG Day 75).
72+
5. **Gated and graceful.** The hybrid path is behind the
73+
`JOB_SEARCH_HYBRID_ENABLED` flag. The query embedding is computed
74+
backend-side; on *any* failure (flag off, no OpenAI key, embedding
75+
error, RPC error) the store falls back to the Tier 1 lexical RPC.
76+
Search never hard-fails because of Tier 2.
77+
78+
The hybrid RPC keeps ADR-014's posture: `SECURITY DEFINER`,
79+
`SET search_path = public`, `EXECUTE` granted to `service_role` only.
80+
81+
### Tier 3 — learned ranker — out of scope
82+
83+
A learned re-ranker trained on click / save / apply signals is the
84+
natural Tier 3. It is deliberately not built: pre-revenue, there is no
85+
interaction data to train on and no labels. RRF is a strong,
86+
zero-training baseline that a Tier 3 ranker would later refine, not
87+
replace.
88+
89+
## Alternatives Considered
90+
91+
### 1. Stay pure lexical (synonym expansion only)
92+
Rejected as the endpoint. Tier 1 alone closes the abbreviation gap but
93+
not the conceptual one — it can only match equivalences someone
94+
thought to add to the map. It ships as Tier 1 *inside* this design,
95+
not instead of it.
96+
97+
### 2. Pure semantic search (replace lexical)
98+
Rejected. Embedding similarity drifts off precise terms — an exact
99+
title or company query underperforms, and rare tokens get washed out.
100+
Lexical precision and semantic recall are complementary; dropping
101+
either loses real results.
102+
103+
### 3. Weighted score blending instead of RRF
104+
Rejected. Blending `ts_rank` and cosine distance needs both on a
105+
common scale; any fixed normalization is a guess that drifts as the
106+
corpus changes. RRF fuses ranks, which are already comparable, and is
107+
the documented production default for hybrid retrieval.
108+
109+
### 4. A managed vector database (Pinecone / Weaviate)
110+
Rejected. pgvector keeps the vectors in the same Postgres that already
111+
holds `cached_jobs` — one datastore, one backup, one access path, the
112+
same `service_role` RPC posture. A separate vector service adds infra,
113+
cost, and a second consistency problem for no gain at this scale.
114+
115+
### 5. IVFFlat index instead of HNSW
116+
Rejected. IVFFlat needs a training pass over a populated table and
117+
re-tuning as the corpus grows. HNSW is correct immediately — which
118+
matters because the `embedding` column is backfilled *after* the index
119+
is created.
120+
121+
## Consequences
122+
123+
### Positive
124+
125+
- Recall improves on both failure modes — abbreviations match their
126+
long forms, and conceptually-related jobs surface even with zero
127+
shared tokens.
128+
- Graceful degradation is structural: hybrid is one flag, and every
129+
Tier 2 failure path falls back to the proven Tier 1 lexical RPC.
130+
- The vector layer is pure Postgres — no new infrastructure, no second
131+
datastore.
132+
133+
### Negative
134+
135+
- An OpenAI embedding cost: a one-time corpus backfill (the
136+
\$0.25–0.50 range estimated in DEVLOG Day 70) plus embed-on-write
137+
for new jobs each refresh (cents/day). Small, but the refresh path
138+
is no longer strictly \$0 — see `deployment.md`.
139+
- The hybrid path adds a query-embedding round trip (~200–500 ms), and
140+
the HNSW index adds write cost to the refresh upserts. The Day 75
141+
incident — re-embedding the whole corpus every refresh churned the
142+
index and timed the refresh out — is the cautionary tail of that
143+
write cost; the fix bounds embed-on-write to genuinely new rows.
144+
- Search logic now spans Python (synonym expansion, query embedding,
145+
fallback orchestration) and two SQL RPCs. The Tier 1 RPC is retained
146+
as both the fallback and the hybrid-disabled path, so the contract
147+
surface is two RPCs, not one.
148+
149+
### Neutral
150+
151+
- `JOB_SEARCH_HYBRID_ENABLED` is the operational switch — Tier 2 can
152+
be turned off without a deploy if the semantic side ever misbehaves.
153+
- The hybrid RPC was revised once post-launch: v1 ranked both sides
154+
with window functions over the full corpus and hit the statement
155+
timeout; v2 uses HNSW / GIN candidate pools (DEVLOG Day 74).
156+
157+
## References
158+
159+
- [ADR-013](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md)
160+
— the `cached_jobs` cache layer this search reads from.
161+
- [ADR-014](ADR-014-postgres-rpc-for-ranked-search.md) — the Tier 1
162+
lexical `search_cached_jobs_ranked` RPC this extends and falls back
163+
to.
164+
- DEVLOG Days 68 (Tier 1), 70 (Tier 2), 74 (hybrid RPC rewrite), 75
165+
(embed-on-write fix).
166+
- SQL: `docs/sql/supabase-cached-jobs-search.sql` (Tier 1),
167+
`supabase-cached-jobs-pgvector.sql` (embedding column + HNSW index),
168+
`supabase-cached-jobs-hybrid.sql` (hybrid RPC).
169+
- Code: `src/job_search_synonyms.py`, `src/cached_jobs_store.py`,
170+
`scripts/backfill_job_embeddings.py`.

docs/adr/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ The accepted set is grouped into four thematic clusters to make the current prod
1919
- [ADR-012: Next.js workspace and FastAPI runtime baseline](ADR-012-nextjs-workspace-and-fastapi-runtime-baseline.md)
2020
- [ADR-013: Cached jobs cache layer with scheduled refresh](ADR-013-cached-jobs-cache-layer-with-scheduled-refresh.md)
2121
- [ADR-014: Postgres RPC for ranked job search](ADR-014-postgres-rpc-for-ranked-search.md)
22+
- [ADR-033: Hybrid lexical + semantic job search](ADR-033-hybrid-job-search.md) — extends ADR-014's lexical RPC to a three-tier relevance design: deterministic synonym / abbreviation query expansion (Tier 1) + a pgvector semantic retriever fused with the lexical ranking via Reciprocal Rank Fusion (Tier 2, `search_cached_jobs_hybrid`), gated behind `JOB_SEARCH_HYBRID_ENABLED` with graceful fallback to lexical. Tier 3 (learned ranker) out of scope pre-revenue
2223
- [ADR-015: DOCX-first artifact export with theme palette](ADR-015-docx-first-artifact-export-with-theme-palette.md)
2324
- [ADR-029: ThemeSpec single-source + color-theme expansion](ADR-029-themespec-single-source-and-color-theme-expansion.md) — realises ADR-015's typed-`ThemeSpec` follow-up; one registry derives résumé + cover-letter + DOCX palettes + the backend gate set; first new theme `modern_blue` (single-column, ATS-safe). Two-column "presentation" layout reserved for a later gated phase — since realised by ADR-032
2425
- [ADR-032: Six bespoke two-column résumé themes](ADR-032-two-column-resume-themes.md) — extends ADR-029's `ThemeSpec` registry with six distinct two-column résumé layouts via a `twocol_layout` discriminator, retiring the single placeholder `presentation_twocol` slot; the registry now ships 12 themes (6 single-column ATS-safe + 6 two-column)
@@ -57,7 +58,7 @@ The accepted set is grouped into four thematic clusters to make the current prod
5758

5859
## Current state note
5960

60-
As of 2026-05-22, the shipped product is a Next.js workspace deployed on Vercel backed by a FastAPI container on a Frankfurt VPS, with a Supabase EU project for Auth + persistence + the cached-jobs index. The agentic workflow runs Tailoring → Review → ResumeGen → CoverLetter on every analysis, with per-agent retry + fallback isolation. Tier enforcement is live across eight counters (Free / Pro / Business) with the Lemon Squeezy payment scaffold env-gated behind a "Coming soon" frontend fallback until the dashboard's final variant IDs land; export format/theme is additionally an entitlement gate (Free = PDF + `professional_neutral`, which is also the product-wide default theme; Pro/Business unlock DOCX + every non-default theme) enforced server-side via the same 429 upgrade path (ADR-027). Theming is one typed `ThemeSpec` registry (ADR-029, ADR-032) deriving the résumé, cover-letter, and DOCX palettes plus the backend supported-theme set; it ships 12 themes — 6 single-column (ATS-safe) and 6 two-column — and its `layout` / `twocol_layout` discriminators drive the renderer, so adding a theme is one registry entry. The observability stack (Sentry `jobagent-backend` + `jobagent-frontend` + a shared PostHog free-tier project tagged with `product: "jobagent"`) is wired with a custom EU cookie consent banner gating PostHog + Sentry Session Replay behind explicit user opt-in; consent persists in a parent-domain cookie so it is honoured across the marketing apex and the `app.` subdomain. `backend/nightly_eval.py` exists and is tested but is **not** on the production cron at pre-revenue stage — re-enabling is a single crontab edit when revenue justifies the recurring LLM spend.
61+
As of 2026-05-22, the shipped product is a Next.js workspace deployed on Vercel backed by a FastAPI container on a Frankfurt VPS, with a Supabase EU project for Auth + persistence + the cached-jobs index. The agentic workflow runs Tailoring → Review → ResumeGen → CoverLetter on every analysis, with per-agent retry + fallback isolation. Tier enforcement is live across eight counters (Free / Pro / Business) with the Lemon Squeezy payment scaffold env-gated behind a "Coming soon" frontend fallback until the dashboard's final variant IDs land; export format/theme is additionally an entitlement gate (Free = PDF + `professional_neutral`, which is also the product-wide default theme; Pro/Business unlock DOCX + every non-default theme) enforced server-side via the same 429 upgrade path (ADR-027). Theming is one typed `ThemeSpec` registry (ADR-029, ADR-032) deriving the résumé, cover-letter, and DOCX palettes plus the backend supported-theme set; it ships 12 themes — 6 single-column (ATS-safe) and 6 two-column — and its `layout` / `twocol_layout` discriminators drive the renderer, so adding a theme is one registry entry. Job search over the `cached_jobs` index is hybrid (ADR-033) — deterministic synonym expansion plus a pgvector semantic ranking fused with the lexical full-text ranking via Reciprocal Rank Fusion, gated behind `JOB_SEARCH_HYBRID_ENABLED` and degrading to lexical-only on any failure. The observability stack (Sentry `jobagent-backend` + `jobagent-frontend` + a shared PostHog free-tier project tagged with `product: "jobagent"`) is wired with a custom EU cookie consent banner gating PostHog + Sentry Session Replay behind explicit user opt-in; consent persists in a parent-domain cookie so it is honoured across the marketing apex and the `app.` subdomain. `backend/nightly_eval.py` exists and is tested but is **not** on the production cron at pre-revenue stage — re-enabling is a single crontab edit when revenue justifies the recurring LLM spend.
6162

6263
## Adding a new ADR
6364

docs/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ Each `saved_jobs` row stores one shortlisted posting per user and normalized job
215215

216216
Each `resume_builder_sessions` row stores one in-progress conversational resume-builder draft per user with a 7-day TTL refreshed on every save. A `pg_cron` job (`cleanup-expired-resume-builder-sessions`) hard-deletes expired rows every 5 min and RLS hides expired rows from per-user queries; see [ADR-016](adr/ADR-016-conversational-llm-resume-builder.md).
217217

218-
Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure.
218+
Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md).
219219

220220
`aijobagent_run_traces` is an append-only cost-attribution table — one row per successful LLM call (`user_id`, `model`, `task`, `prompt_tokens`, `completion_tokens`, `cost_usd`, `created_at`). Writes are best-effort: a missing table or a write error never propagates to the user-facing path. It is the canonical answer to "what is OpenAI spend doing", separate from the Sentry/PostHog telemetry surface.
221221

0 commit comments

Comments
 (0)