Tier 2: rewrite hybrid search RPC — HNSW candidate pools

LEANDERANTONY · claude · LEANDERANTONY · commit ad4c3bda1bd9 · 2026-05-22T01:22:54.000+05:30
The v1 search_cached_jobs_hybrid timed out against the live ~14k-row
corpus: it ranked both the lexical and semantic sides with window
functions over one shared candidates CTE, so the semantic
row_number() had to sort every embedding with no usable index — the
HNSW index was never touched. v2 makes each retriever its own
ORDER BY ... LIMIT 200 query on cached_jobs, so the GIN / HNSW
indexes drive candidate selection; RRF fusion is unchanged. Syncs
the repo SQL file to the live function and adds DEVLOG Day 74
(embedding backfill completion + the rewrite).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/DEVLOG.md b/docs/DEVLOG.md
@@ -3728,3 +3728,65 @@ populated local `.env`, two unrelated tests
 fail because the dev `.env` sets `JOB_SEARCH_HYBRID_ENABLED=true`
 and `SUPABASE_SERVICE_ROLE_KEY`; both are green on CI (which ships
 no `.env`) and out of scope for this fix.
+
+## Day 74: Tier 2 hybrid search — backfill + the v1→v2 RPC rewrite
+
+Day 70 shipped Tier 2 as code + SQL files with the production
+rollout left as operator actions. This day is that rollout: the
+pgvector schema applied, the embedding corpus backfilled, the
+hybrid RPC turned on — and the RPC rewritten when its first cut
+buckled against the real corpus.
+
+**Backfill — 14,319 / 14,319 embedded.** `backfill_job_embeddings.py`
+processes at most 1000 rows per invocation (the PostgREST row cap),
+so it ran in a loop until `embedding IS NULL` returned zero: 16
+invocations, the last a no-op confirming nothing was left. Two
+single-row embedding failures along the way (passes 8 and 14, one
+row each, batch otherwise clean) self-healed — the `embedding IS
+NULL` filter makes the script resumable, so a later pass re-picked
+them. One-time cost landed within the $0.25–0.50 estimate. Final
+state: every cached job carries a `text-embedding-3-small` vector.
+
+**The v1 RPC timed out.** Smoke-testing `search_cached_jobs_hybrid`
+against the full corpus returned `57014: canceling statement due to
+statement timeout`. Root cause: v1 computed both ranks with window
+functions over one shared `candidates` CTE containing every
+filtered row. The semantic `row_number() OVER (ORDER BY embedding
+<=> ...)` therefore had to compute cosine distance for all ~14k
+rows and sort them — a window-function ORDER BY cannot use an
+index, so the HNSW index built in Phase A was never touched. Pure
+sequential distance math over the whole corpus, every call.
+
+**v2 — HNSW candidate pools.** The rewrite makes each retriever its
+own top-N query reading `cached_jobs` directly: `ORDER BY
+ts_rank(...) DESC LIMIT 200` for lexical (GIN index) and `ORDER BY
+embedding <=> p_query_embedding LIMIT 200` for semantic (HNSW
+index). An `ORDER BY ... LIMIT` on a base table is exactly the
+shape an index can serve, so both indexes are back in the plan.
+RRF fusion, the FULL OUTER JOIN, the sort modes, and the
+service_role-only grants are unchanged; a `NOT has_query AND NOT
+has_embedding` early branch handles browse mode without building
+either pool. EXPLAIN ANALYZE confirms `Index Scan using
+cached_jobs_embedding_hnsw_idx`; the call that timed out now
+returns in well under a second.
+
+**hnsw.ef_search stays at the default (40).** The semantic `LIMIT
+200` is bounded by `ef_search`, so the real semantic pool is ~40
+rows. Widening it was tried and dropped. A function-level `SET
+hnsw.ef_search` clause fails at CREATE time with `42501 permission
+denied to set parameter` — the migration role lacks the privilege.
+A body-level `SET LOCAL` fails at call time with `0A000 SET is not
+allowed in a non-volatile function`: the RPC is STABLE, and only
+VOLATILE functions may run SET. And the payoff would not justify it
+anyway — EXPLAIN ANALYZE at `ef_search = 100` measured ~650ms for
+the semantic scan against ~190ms at the default, roughly 3x the
+latency for a deeper pool a 20–50 row page does not need. 40
+semantic candidates fused against 200 lexical is ample. An operator
+who ever wants deeper recall can widen it globally with `ALTER
+DATABASE postgres SET hnsw.ef_search`.
+
+Verification: all three code paths return a full page against the
+live corpus — fusion (query + embedding), pure-lexical (query, no
+embedding), and browse (neither) — with no timeout.
+`docs/sql/supabase-cached-jobs-hybrid.sql` now mirrors the live
+function exactly.
diff --git a/docs/sql/supabase-cached-jobs-hybrid.sql b/docs/sql/supabase-cached-jobs-hybrid.sql
@@ -14,7 +14,7 @@
 -- two RANKINGS (not their raw scores — which live on incomparable scales)
 -- so a job ranked highly by EITHER signal surfaces.
 --
--- PREREQUISITES (operator must do these first — see DEVLOG Day 69
+-- PREREQUISITES (operator must do these first — see DEVLOG Day 70
 -- "OPERATOR ACTION REQUIRED"):
 --   1. `docs/sql/supabase-cached-jobs-pgvector.sql` applied — the
 --      `vector` extension, `cached_jobs.embedding vector(1536)` column,
@@ -25,6 +25,29 @@
 -- This file (the RPC) is applied AFTER those two steps, then the operator
 -- flips JOB_SEARCH_HYBRID_ENABLED=true.
 --
+-- ARCHITECTURE — HNSW CANDIDATE POOLS (the v2 rewrite, DEVLOG Day 74):
+-- Each retriever is its OWN top-N query reading `cached_jobs` directly,
+-- so the index can drive candidate selection:
+--   - lexical  — GIN on search_tsv; `ORDER BY ts_rank DESC LIMIT 200`.
+--   - semantic — HNSW vector_cosine_ops on embedding; `ORDER BY <=>
+--                LIMIT 200`.
+-- The first cut ranked BOTH lists with window functions over one shared
+-- `candidates` CTE of every filtered row; the semantic `row_number()`
+-- then had to sort all ~14k embeddings with no usable index and hit the
+-- statement timeout against the real corpus. Per-retriever top-N queries
+-- keep the HNSW/GIN index in the plan — verified via EXPLAIN ANALYZE.
+--
+-- NOTE ON hnsw.ef_search: left at the pgvector default (40), so the HNSW
+-- scan yields ~40 semantic candidates regardless of the `LIMIT 200` above.
+-- That is ample fused against 200 lexical candidates for a 20-50 row page.
+-- It is deliberately NOT widened in-function: a function-level `SET`
+-- clause is rejected (`42501 permission denied to set parameter` — the
+-- migration role lacks the privilege) and a body-level `SET LOCAL` is
+-- rejected too (`0A000 SET is not allowed in a non-volatile function` —
+-- this RPC is STABLE). An operator with privilege can widen recall
+-- globally via `ALTER DATABASE postgres SET hnsw.ef_search = <n>` if the
+-- semantic pool ever needs to be deeper.
+--
 -- PARAMS: identical to `search_cached_jobs_ranked` (p_query, p_location,
 -- p_sources, p_remote_only, p_posted_within_days, p_limit, p_work_modes,
 -- p_employment_types, p_sort_by, p_offset) PLUS:
@@ -36,7 +59,7 @@
 -- syntax string already built by the backend's synonym expander
 -- (src/job_search_synonyms.py), NOT raw user text. Parsed with
 -- `to_tsquery` (not websearch_to_tsquery) for the same reason. Empty
--- p_query = no lexical filter.
+-- p_query = no lexical retriever.
 --
 -- RRF FUSION: for each job, with k = 60 (the standard RRF constant):
 --     rrf = 1.0/(k + lexical_rank) + 1.0/(k + semantic_rank)
@@ -45,16 +68,13 @@
 --     have an embedding.
 --   - A job present in only ONE ranked list contributes ONLY that list's
 --     term (the other term is 0) — implemented via a FULL OUTER JOIN and
---     treating a missing rank's term as 0.
+--     COALESCE(..., 0.0) on the missing term.
 --
 -- DEGENERATE CASES (all must still return sensible rows):
---   - empty p_query                -> lexical list is "all filtered rows"
---                                     unranked-by-text; effectively the
---                                     semantic ranking (or recency)
---                                     drives ordering. No to_tsquery call.
---   - NULL p_query_embedding       -> semantic list empty; pure lexical.
---   - both empty/NULL              -> filtered rows ordered by recency
---                                     (same as browse mode).
+--   - query present, NULL embedding -> semantic pool empty; pure lexical.
+--   - empty query, embedding present -> lexical pool empty; pure semantic.
+--   - both empty/NULL -> browse mode: an early-return branch lists the
+--     filtered rows ordered by recency (no retriever / fusion needed).
 --
 -- SORTING: same p_sort_by modes as the Tier 1 RPC. 'relevance' (default)
 -- orders by the fused RRF score DESC; 'newest'/'oldest'/'company_az'
@@ -100,7 +120,7 @@ DECLARE
     -- RRF damping constant. 60 is the value from the original RRF paper
     -- and the de-facto production default; larger k flattens the
     -- contribution curve, smaller k sharpens it toward rank-1.
-    rrf_k         CONSTANT integer := 60;
+    rrf_k    CONSTANT integer := 60;
 BEGIN
     IF has_query THEN
         -- Same contract as the Tier 1 RPC: p_query is a to_tsquery-
@@ -109,126 +129,125 @@ BEGIN
         tsquery_obj := to_tsquery('english', p_query);
     END IF;
 
-    RETURN QUERY
     -- ----------------------------------------------------------------
-    -- candidates: every row that survives the (non-text, non-vector)
-    -- FILTERS. Both ranked lists below draw from this same pool, so a
-    -- job's rank reflects its standing among ELIGIBLE jobs only.
+    -- Browse mode (no text query, no query embedding): there is no
+    -- ranking signal, so skip the retrievers / fusion entirely and
+    -- return the filtered rows ordered by recency.
     -- ----------------------------------------------------------------
-    WITH candidates AS (
+    IF NOT has_query AND NOT has_embedding THEN
+        RETURN QUERY
         SELECT cj.*
         FROM public.cached_jobs cj
         WHERE cj.removed_at IS NULL
-          AND (
-              COALESCE(p_location, '') = ''
-              OR cj.location ILIKE '%' || p_location || '%'
-          )
-          AND (
-              p_sources IS NULL
-              OR cardinality(p_sources) = 0
-              OR cj.source = ANY (p_sources)
-          )
+          AND (COALESCE(p_location, '') = '' OR cj.location ILIKE '%' || p_location || '%')
+          AND (p_sources IS NULL OR cardinality(p_sources) = 0 OR cj.source = ANY (p_sources))
           AND (NOT p_remote_only OR cj.work_mode = 'remote')
-          AND (
-              p_work_modes IS NULL
-              OR cardinality(p_work_modes) = 0
-              OR cj.work_mode = ANY (p_work_modes)
-          )
-          AND (
-              p_employment_types IS NULL
-              OR cardinality(p_employment_types) = 0
-              OR cj.employment_type_norm = ANY (p_employment_types)
-          )
-          AND (
-              p_posted_within_days IS NULL
-              OR cj.posted_at > NOW() - (p_posted_within_days || ' days')::INTERVAL
-          )
-    ),
+          AND (p_work_modes IS NULL OR cardinality(p_work_modes) = 0 OR cj.work_mode = ANY (p_work_modes))
+          AND (p_employment_types IS NULL OR cardinality(p_employment_types) = 0 OR cj.employment_type_norm = ANY (p_employment_types))
+          AND (p_posted_within_days IS NULL OR cj.posted_at > NOW() - (p_posted_within_days || ' days')::INTERVAL)
+        ORDER BY
+            CASE sort_mode
+                WHEN 'oldest' THEN -extract(epoch from cj.posted_at)
+                ELSE extract(epoch from cj.posted_at)
+            END DESC NULLS LAST,
+            CASE sort_mode WHEN 'company_az' THEN LOWER(cj.company) ELSE NULL END ASC NULLS LAST,
+            cj.posted_at DESC NULLS LAST,
+            cj.id DESC
+        LIMIT GREATEST(1, LEAST(COALESCE(p_limit, 20), 50))
+        OFFSET GREATEST(0, COALESCE(p_offset, 0));
+        RETURN;
+    END IF;
+
     -- ----------------------------------------------------------------
-    -- lexical: 1-based rank by ts_rank DESC among rows whose tsvector
-    -- matches the query. When p_query is empty there is no FTS filter
-    -- and no meaningful text rank — every candidate is included with a
-    -- NULL lexical_rank so it contributes 0 to the lexical RRF term
-    -- (its placement then comes from the semantic side / recency).
+    -- Hybrid path: at least one of (text query, query embedding) is
+    -- present. Each retriever is its own top-N query on `cached_jobs`
+    -- so the GIN / HNSW index drives candidate selection (see the
+    -- ARCHITECTURE note in the header); RRF then fuses the two
+    -- rankings.
     -- ----------------------------------------------------------------
+    RETURN QUERY
+    WITH
+    -- lexical: top-200 by ts_rank among rows whose tsvector matches the
+    -- query. Empty when has_query is false (-> pure-semantic fallback).
+    -- The GIN index on search_tsv accelerates the `@@` filter; the
+    -- outer row_number() assigns the 1-based lexical rank for RRF.
     lexical AS (
-        SELECT
-            c.id,
-            CASE
-                WHEN has_query
-                THEN row_number() OVER (
-                    ORDER BY ts_rank(c.search_tsv, tsquery_obj) DESC,
-                             c.posted_at DESC NULLS LAST,
-                             c.id
-                )
-                ELSE NULL::bigint
-            END AS lexical_rank
-        FROM candidates c
-        WHERE NOT has_query OR c.search_tsv @@ tsquery_obj
+        SELECT l.id,
+               row_number() OVER (
+                   ORDER BY l.rank_score DESC, l.posted_at DESC NULLS LAST, l.id
+               ) AS lexical_rank
+        FROM (
+            SELECT cj.id, cj.posted_at,
+                   ts_rank(cj.search_tsv, tsquery_obj) AS rank_score
+            FROM public.cached_jobs cj
+            WHERE has_query
+              AND cj.removed_at IS NULL
+              AND cj.search_tsv @@ tsquery_obj
+              AND (COALESCE(p_location, '') = '' OR cj.location ILIKE '%' || p_location || '%')
+              AND (p_sources IS NULL OR cardinality(p_sources) = 0 OR cj.source = ANY (p_sources))
+              AND (NOT p_remote_only OR cj.work_mode = 'remote')
+              AND (p_work_modes IS NULL OR cardinality(p_work_modes) = 0 OR cj.work_mode = ANY (p_work_modes))
+              AND (p_employment_types IS NULL OR cardinality(p_employment_types) = 0 OR cj.employment_type_norm = ANY (p_employment_types))
+              AND (p_posted_within_days IS NULL OR cj.posted_at > NOW() - (p_posted_within_days || ' days')::INTERVAL)
+            ORDER BY ts_rank(cj.search_tsv, tsquery_obj) DESC
+            LIMIT 200
+        ) l
     ),
-    -- ----------------------------------------------------------------
-    -- semantic: 1-based rank by cosine distance ASC among candidates
-    -- that HAVE an embedding. Empty when p_query_embedding is NULL ->
-    -- pure-lexical fallback. `<=>` is pgvector cosine distance; the
-    -- HNSW vector_cosine_ops index from the pgvector schema file
-    -- accelerates this ordering.
-    -- ----------------------------------------------------------------
+    -- semantic: top-200 by cosine distance among rows that HAVE an
+    -- embedding. Empty when p_query_embedding is NULL (-> pure-lexical
+    -- fallback). `<=>` is pgvector cosine distance; the HNSW
+    -- vector_cosine_ops index serves this ORDER BY ... LIMIT directly.
     semantic AS (
-        SELECT
-            c.id,
-            row_number() OVER (
-                ORDER BY c.embedding <=> p_query_embedding,
-                         c.posted_at DESC NULLS LAST,
-                         c.id
-            ) AS semantic_rank
-        FROM candidates c
-        WHERE has_embedding
-          AND c.embedding IS NOT NULL
+        SELECT s.id,
+               row_number() OVER (
+                   ORDER BY s.dist ASC, s.posted_at DESC NULLS LAST, s.id
+               ) AS semantic_rank
+        FROM (
+            SELECT cj.id, cj.posted_at,
+                   cj.embedding <=> p_query_embedding AS dist
+            FROM public.cached_jobs cj
+            WHERE has_embedding
+              AND cj.embedding IS NOT NULL
+              AND cj.removed_at IS NULL
+              AND (COALESCE(p_location, '') = '' OR cj.location ILIKE '%' || p_location || '%')
+              AND (p_sources IS NULL OR cardinality(p_sources) = 0 OR cj.source = ANY (p_sources))
+              AND (NOT p_remote_only OR cj.work_mode = 'remote')
+              AND (p_work_modes IS NULL OR cardinality(p_work_modes) = 0 OR cj.work_mode = ANY (p_work_modes))
+              AND (p_employment_types IS NULL OR cardinality(p_employment_types) = 0 OR cj.employment_type_norm = ANY (p_employment_types))
+              AND (p_posted_within_days IS NULL OR cj.posted_at > NOW() - (p_posted_within_days || ' days')::INTERVAL)
+            ORDER BY cj.embedding <=> p_query_embedding
+            LIMIT 200
+        ) s
     ),
-    -- ----------------------------------------------------------------
     -- fused: FULL OUTER JOIN the two ranked lists so a job appearing in
     -- only one still survives. RRF score sums the per-list terms; a
     -- missing rank contributes 0 (COALESCE(..., 0.0)).
-    -- ----------------------------------------------------------------
     fused AS (
         SELECT
             COALESCE(l.id, s.id) AS id,
-            COALESCE(
-                CASE WHEN l.lexical_rank IS NOT NULL
-                     THEN 1.0 / (rrf_k + l.lexical_rank)
-                     ELSE 0.0 END,
-                0.0
-            )
-            +
-            COALESCE(
-                CASE WHEN s.semantic_rank IS NOT NULL
-                     THEN 1.0 / (rrf_k + s.semantic_rank)
-                     ELSE 0.0 END,
-                0.0
-            ) AS rrf_score
+            COALESCE(1.0 / (rrf_k + l.lexical_rank), 0.0)
+          + COALESCE(1.0 / (rrf_k + s.semantic_rank), 0.0) AS rrf_score
         FROM lexical l
         FULL OUTER JOIN semantic s ON s.id = l.id
     )
-    -- ----------------------------------------------------------------
-    -- Final projection: join the fused scores back to the candidate
-    -- rows (so the SETOF cached_jobs shape is returned) and order by
-    -- the requested sort mode. 'relevance' uses the RRF score.
-    -- ----------------------------------------------------------------
-    SELECT c.*
+    -- Final projection: join the fused scores back to `cached_jobs` (so
+    -- the SETOF cached_jobs shape is returned) and order by the
+    -- requested sort mode. 'relevance' uses the fused RRF score.
+    SELECT cj.*
     FROM fused f
-    JOIN candidates c ON c.id = f.id
+    JOIN public.cached_jobs cj ON cj.id = f.id
     ORDER BY
         CASE sort_mode
-            WHEN 'newest'     THEN extract(epoch from c.posted_at)
-            WHEN 'oldest'     THEN -extract(epoch from c.posted_at)
+            WHEN 'newest'     THEN extract(epoch from cj.posted_at)
+            WHEN 'oldest'     THEN -extract(epoch from cj.posted_at)
             WHEN 'company_az' THEN NULL  -- secondary key carries
             ELSE f.rrf_score  -- 'relevance' (default): fused RRF score
         END DESC NULLS LAST,
         -- Stable secondary keys to break ties — mirrors the Tier 1 RPC.
-        CASE sort_mode WHEN 'company_az' THEN LOWER(c.company) ELSE NULL END
+        CASE sort_mode WHEN 'company_az' THEN LOWER(cj.company) ELSE NULL END
             ASC NULLS LAST,
-        c.posted_at DESC NULLS LAST,
-        c.id DESC
+        cj.posted_at DESC NULLS LAST,
+        cj.id DESC
     LIMIT GREATEST(1, LEAST(COALESCE(p_limit, 20), 50))
     OFFSET GREATEST(0, COALESCE(p_offset, 0));
 END;