Tier 2: embed-on-write only embeds new jobs — fix HNSW churn

LEANDERANTONY · claude · LEANDERANTONY · commit 46cf92be2d7d · 2026-05-22T02:02:40.000+05:30
JOB_SEARCH_HYBRID_ENABLED=true enabled embed-on-write, which
re-embedded and re-wrote the embedding vector for every job on
every 4-hour refresh. The HNSW index churn blew the upsert
statement timeout (57014), failed the refresh in a loop, and the
resulting dead-tuple autovacuum saturated the DB enough to time
out /api/jobs/search itself.

upsert_postings now embeds only postings new to the cache (via
_already_embedded_job_ids — two cheap indexed reads) and splits
the upsert into a no-embedding batch + an embedded batch, so
unchanged rows never touch the HNSW index. Embed-on-write keeps
its non-fatal contract. Adds DEVLOG Day 75; pins two store tests
that were silently env-dependent on JOB_SEARCH_HYBRID_ENABLED.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/DEVLOG.md b/docs/DEVLOG.md
@@ -3790,3 +3790,55 @@ live corpus — fusion (query + embedding), pure-lexical (query, no
 embedding), and browse (neither) — with no timeout.
 `docs/sql/supabase-cached-jobs-hybrid.sql` now mirrors the live
 function exactly.
+
+## Day 75: Embed-on-write churn — the Tier 2 production incident
+
+Day 74 finalized the hybrid RPC and the flag went live
+(`JOB_SEARCH_HYBRID_ENABLED=true` on the VPS). A production smoke
+check of `/api/jobs/search` then found the endpoint timing out at
+the gateway — job search was effectively down. The cause was not
+the hybrid query; it was **embed-on-write**.
+
+**What broke.** Flipping the flag switched on two things, not one:
+the hybrid *search* path AND embed-on-write in the cache-refresh
+path. `CachedJobsStore.upsert_postings` was embedding and writing
+the `embedding` vector for *every* posting in *every* chunk of
+*every* 4-hour refresh — all ~14k jobs, changed or not. Each
+vector write updates the HNSW index. The refresh upserts in chunks
+of 30, already sized to fit Postgres's statement timeout for the
+`search_tsv` + GIN index work; the extra HNSW churn pushed them
+past it, and the VPS logs filled with `Failed to upsert chunk for
+greenhouse … 57014 statement timeout`.
+
+The cascade: failed upserts retried in a loop → mass dead tuples →
+a multi-minute autovacuum on `cached_jobs` → a saturated database
+→ search queries (the hybrid RPC *and* its lexical fallback) slow
+enough to trip the statement timeout too → the whole request
+exceeding the ~30s gateway timeout. One root cause, two dead
+features: the refresh and search both down.
+
+**The fix — embed only new rows.** `upsert_postings` no longer
+re-embeds the corpus. A new `_embed_new_rows` step embeds only the
+postings that are NEW to the cache (or exist with a NULL embedding
+from a prior failure); `_already_embedded_job_ids` finds the rest
+with two cheap `(source, job_id)`-indexed reads. The upsert is
+then issued as up to two batches — rows without an `embedding` key
+(the bulk; their stored vector is left intact on conflict) and
+rows with a freshly-computed one. Each PostgREST request keeps a
+homogeneous column set, and the HNSW index is written only for the
+handful of genuinely-new jobs per refresh. In steady state a chunk
+has zero new rows → a single text-only upsert, exactly the proven
+pre-Tier-2 cost.
+
+Embed-on-write keeps its non-fatality contract: any failure (the
+diff reads, the embeddings call, a vector-count mismatch) logs and
+degrades to "cache the row without an embedding" — the hybrid RPC
+treats a NULL embedding as lexical-only and the next refresh
+retries it. The fix is contained to `cached_jobs_store.py`; the
+flag stays on, so once deployed the next refresh self-corrects.
+
+Verification: 28 `test_cached_jobs_store` tests green, including a
+new `test_upsert_skips_reembedding_already_embedded_rows`. Two
+store tests that were silently env-dependent — routing through the
+hybrid RPC whenever a local `.env` had the flag on — now pin the
+flag explicitly.
diff --git a/src/cached_jobs_store.py b/src/cached_jobs_store.py
@@ -191,12 +191,17 @@ def upsert_postings(self, source: str, postings: Iterable) -> int:
         matching attributes — `id`, `source`, `title`, `company`, etc.).
         Conflict key is (source, job_id) so re-runs are idempotent.
 
-        Tier 2 embed-on-write: when hybrid search is enabled, each row's
-        `embedding` is computed in the same call so the corpus stays
-        current without a re-run of the backfill. This is STRICTLY
-        non-fatal — if the embeddings call fails, the jobs are still
-        upserted (without an embedding; the hybrid RPC degrades those
-        rows to lexical until the next backfill / refresh fills them).
+        Tier 2 embed-on-write: when hybrid search is enabled, rows that
+        are NEW to the cache (no embedding stored yet) get an
+        `embedding` computed in this same call. Rows already embedded
+        are NOT re-embedded — re-writing every job's vector on every
+        4-hour refresh churned the HNSW index hard enough to blow the
+        upsert statement timeout (DEVLOG Day 75). STRICTLY non-fatal:
+        if the embeddings call fails the jobs are still upserted, just
+        without an embedding. The upsert is issued as up to two batches
+        (rows without an `embedding`, then rows with one) so each
+        PostgREST request has a homogeneous column set and the HNSW
+        index is only touched for the small newly-embedded batch.
         """
         client = self._require_client()
         rows = []
@@ -233,33 +238,48 @@ def upsert_postings(self, source: str, postings: Iterable) -> int:
         if not rows:
             return 0
         # Tier 2 embed-on-write — non-fatal. Attaches an `embedding` to
-        # each row dict in place. Any failure inside leaves the rows
-        # embedding-free and the upsert proceeds regardless.
-        self._attach_embeddings_on_write(rows)
+        # the subset of `rows` that are NEW to the cache; already-
+        # embedded rows are left untouched so the upsert never rewrites
+        # their vector.
+        self._embed_new_rows(source, rows)
+        # Split by column shape: a row dict WITHOUT an `embedding` key
+        # leaves the stored vector intact on conflict; rows WITH a fresh
+        # embedding go in their own batch. Two homogeneous upserts keep
+        # each PostgREST request's column set consistent and confine the
+        # HNSW index writes to the (small) new-rows batch.
+        without_embedding = [row for row in rows if "embedding" not in row]
+        with_embedding = [row for row in rows if "embedding" in row]
+        touched = 0
         try:
-            response = (
-                client.table(self._table)
-                .upsert(rows, on_conflict="source,job_id")
-                .execute()
-            )
+            for batch in (without_embedding, with_embedding):
+                if not batch:
+                    continue
+                response = (
+                    client.table(self._table)
+                    .upsert(batch, on_conflict="source,job_id")
+                    .execute()
+                )
+                touched += len(self._extract_rows(response))
         except Exception as exc:
             raise AppError(
                 "Failed to upsert cached jobs.",
                 details=f"{type(exc).__name__}: {exc}",
             ) from exc
-        return len(self._extract_rows(response))
+        return touched
 
-    def _attach_embeddings_on_write(self, rows: list[dict]) -> None:
-        """Compute + attach an `embedding` to each upsert row in place.
+    def _embed_new_rows(self, source: str, rows: list[dict]) -> None:
+        """Attach an `embedding` to the `rows` that are new to the cache.
 
-        The Tier 2 embed-on-write path. Mutates `rows`: on success each
-        dict gains an `embedding` key (a list[float]); on ANY failure
-        the rows are left untouched and the caller upserts them without
-        embeddings. This method NEVER raises — embed-on-write must not
-        be able to break the refresh worker.
+        The Tier 2 embed-on-write path. Mutates `rows` in place: each
+        NEW row's dict gains an `embedding` key (a list[float]). Rows
+        already carrying a vector are left untouched — re-embedding the
+        whole corpus on every refresh churned the HNSW index and timed
+        out the chunk upserts (DEVLOG Day 75).
 
-        Skipped entirely when hybrid search is disabled (no point
-        spending tokens on a column nothing queries yet).
+        NEVER raises — embed-on-write must not be able to break the
+        refresh worker. Skipped entirely when hybrid search is disabled
+        or no OpenAI service is available (no point spending tokens on a
+        column nothing queries yet).
         """
         if not rows:
             return
@@ -268,14 +288,37 @@ def _attach_embeddings_on_write(self, rows: list[dict]) -> None:
         service = self._get_openai_service()
         if service is None or not service.is_available():
             return
+        # Only NEW / un-embedded jobs need work. A failure to determine
+        # which those are degrades to "embed nothing this chunk" rather
+        # than risking a re-embed of the whole corpus.
+        try:
+            already_embedded = self._already_embedded_job_ids(
+                source, [row["job_id"] for row in rows]
+            )
+        except Exception as exc:  # noqa: BLE001 — embed-on-write is non-fatal
+            log_event(
+                LOGGER,
+                logging.WARNING,
+                "cached_jobs_embed_diff_failed",
+                "Could not determine which jobs still need embeddings; "
+                "skipping embed-on-write for this chunk.",
+                row_count=len(rows),
+                error=f"{type(exc).__name__}: {exc}",
+            )
+            return
+        new_rows = [
+            row for row in rows if row["job_id"] not in already_embedded
+        ]
+        if not new_rows:
+            return
         try:
             inputs = [
                 build_job_embedding_input(
                     title=row.get("title", ""),
                     company=row.get("company", ""),
                     description=row.get("description", ""),
                 )
-                for row in rows
+                for row in new_rows
             ]
             vectors = service.create_embeddings(
                 inputs, task_name="job_embedding_on_write"
@@ -285,28 +328,81 @@ def _attach_embeddings_on_write(self, rows: list[dict]) -> None:
                 LOGGER,
                 logging.WARNING,
                 "cached_jobs_embed_on_write_failed",
-                "Embed-on-write failed; jobs cached without embeddings "
-                "(they fall back to lexical until the next backfill).",
-                row_count=len(rows),
+                "Embed-on-write failed; new jobs cached without "
+                "embeddings (a later refresh re-embeds them).",
+                row_count=len(new_rows),
                 error=f"{type(exc).__name__}: {exc}",
             )
             return
-        if len(vectors) != len(rows):
+        if len(vectors) != len(new_rows):
             # Count mismatch — can't safely pair vectors to rows; skip
             # attaching rather than risk the wrong vector on a job.
             log_event(
                 LOGGER,
                 logging.WARNING,
                 "cached_jobs_embed_on_write_count_mismatch",
                 "Embed-on-write returned a mismatched vector count; "
-                "jobs cached without embeddings.",
-                row_count=len(rows),
+                "new jobs cached without embeddings.",
+                row_count=len(new_rows),
                 vector_count=len(vectors),
             )
             return
-        for row, vector in zip(rows, vectors):
+        for row, vector in zip(new_rows, vectors):
             row["embedding"] = vector
 
+    def _already_embedded_job_ids(
+        self, source: str, job_ids: list[str]
+    ) -> set[str]:
+        """job_ids (within `source`) already in cached_jobs WITH a
+        non-NULL embedding.
+
+        Embed-on-write skips these so the upsert never rewrites a stored
+        vector. Computed with two cheap (source, job_id)-indexed reads:
+        the rows that exist, and (of those) the ones whose embedding is
+        still NULL — the difference is the already-embedded set. A row
+        that exists but has a NULL embedding (e.g. a prior embed-on-
+        write failure) is therefore NOT in the set, so the next refresh
+        retries it.
+
+        Raises on a query failure — the caller treats that as "embed
+        nothing this chunk".
+        """
+        client = self._require_client()
+        ids = [jid for jid in job_ids if jid]
+        if not ids:
+            return set()
+        existing: set[str] = set()
+        null_embedding: set[str] = set()
+        # Bound the IN-list so a large chunk can't build an over-long
+        # PostgREST URL. The refresh upserts ~30 at a time, so this is
+        # normally a single pass.
+        for start in range(0, len(ids), 200):
+            batch = ids[start : start + 200]
+            existing_response = (
+                client.table(self._table)
+                .select("job_id")
+                .eq("source", source)
+                .in_("job_id", batch)
+                .execute()
+            )
+            for row in self._extract_rows(existing_response):
+                jid = str(row.get("job_id") or "").strip()
+                if jid:
+                    existing.add(jid)
+            null_response = (
+                client.table(self._table)
+                .select("job_id")
+                .eq("source", source)
+                .in_("job_id", batch)
+                .is_("embedding", "null")
+                .execute()
+            )
+            for row in self._extract_rows(null_response):
+                jid = str(row.get("job_id") or "").strip()
+                if jid:
+                    null_embedding.add(jid)
+        return existing - null_embedding
+
     def cleanup_missing(
         self,
         *,
diff --git a/tests/test_cached_jobs_store.py b/tests/test_cached_jobs_store.py