docs: record Day 82 — free downgrade, lean mode executed, cron-log purge

LEANDERANTONY · LEANDERANTONY · commit c450d228e9b1 · 2026-06-23T19:27:51.000+05:30
DEVLOG Day 82: the org was downgraded Supabase Pro -> free, lean mode was executed (470 -> 171 MB, embeddings + HNSW + hybrid RPC dropped, JOB_SEARCH_HYBRID_ENABLED=false, search now lexical-only, all 14,083 jobs intact), the landing copy was corrected to drop the now-false semantic/hybrid claims (commit 470e62f), and a daily purge-cron-run-history pg_cron job was added (cleared 29,156 stale rows). Also documents the corrected lesson: a plain VACUUM is NOT a durable lever here (the churn re-bloats ~90 MB/day back to the ~485 MB high-water-mark, not 30-60 MB/month as Day 81 wrongly claimed). deployment.md: - scheduled-jobs inventory: add purge-cron-run-history; note embed-on-write is OFF under lean mode. - hybrid-search lean/full runbook: flipped from "on Pro, lean is hypothetical" to "CURRENT STATE: on free, lean ACTIVE"; the full/ hybrid half is now the restore-on-Pro path. Corrected the wrong "VACUUM buys 30-60 MB/month" claim. architecture.md: note that prod runs lean (embedding column + HNSW + hybrid RPC dropped, flag false, lexical-only search) as of 2026-06-23; the hybrid capability remains in code as a restore-on-Pro operation. (AGENT.md §3 + §6 updated locally too — free plan, lean active, ~171 MB, the new cron job; untracked working briefing, not in this commit.)
diff --git a/docs/DEVLOG.md b/docs/DEVLOG.md
@@ -4232,3 +4232,53 @@ to click — the docs leave it downgrade-ready.
 No application code changed this day — the flag wiring was already in
 `src/cached_jobs_store.py` + `src/config.py`; everything else is SQL +
 docs + one maintenance command.
+
+## Day 82: downgraded to free + executed lean mode + cron-log purge
+
+The Day 81 framing ("on Pro, 504 MB is fine, lean mode is hypothetical")
+flipped to reality: the org was downgraded **Pro → free** to cut the
+~$25/mo pre-revenue, and lean mode was executed for real.
+
+**The VACUUM-isn't-durable lesson.** The Day 81 VACUUM dipped the DB
+504 → 381 MB, but within ONE day the 4-hourly refresh churn refilled it
+to **470 MB** (row count flat at 14k, dead_tuples 0 — the file
+high-water-mark just re-grew, ~90 MB/day back toward its ~485 MB natural
+size). So the Day 81 estimate of "2-4 months runway / ~30-60 MB per
+month" was wrong by an order of magnitude. On a table that churns 6×/day
+with embeddings, a plain VACUUM buys ~2 days, not months. Corrected the
+deployment.md runbook accordingly.
+
+**Lean mode, executed (jobagent prod):** flipped
+`JOB_SEARCH_HYBRID_ENABLED=false` in the VPS `.env` + recreated the api
+container (embed-on-write off, search degrades to the lexical
+`search_cached_jobs_ranked` RPC) → applied migration
+`cached_jobs_lean_mode_drop_semantic_layer` (drop HNSW index + hybrid
+RPC + `embedding` column) → `VACUUM FULL`. Result: **470 → 171 MB**
+(`cached_jobs` 451 → 151 MB), all 14,083 jobs intact, **330 MB under the
+500 MB cap**, live lexical search verified returning real results. This
+is durable where the VACUUM wasn't: no embed-on-write churn + no HNSW
+bloat means the DB stays ~150-200 MB instead of re-climbing. Reversible
+back to semantic on a future Pro upgrade (pgvector.sql + backfill (~2¢) +
+hybrid.sql + flag on).
+
+**Landing copy corrected (commit `470e62f`).** Search is lexical-only
+now, but the landing still advertised "Semantic job search" / "Hybrid
+lexical + semantic search understands what you mean" — false at launch.
+Swapped to accurate copy for the real Tier 1 feature: the hero pill is
+now "Synonym-aware search" and the bento describes the genuine
+synonym/abbreviation expansion ("SWE" finds "software engineer", "PM"
+finds "product manager", via `src/job_search_synonyms.py`). tsc + eslint
+clean.
+
+**New cron `purge-cron-run-history` (daily `30 3 * * *`).** pg_cron's
+own `cron.job_run_details` run-history table grows unboundedly — the
+5-min resume-builder cleanup alone adds ~288 rows/day — and had reached
+~30k stale rows (5.6 MB). Added a daily `DELETE ... WHERE end_time <
+now() - interval '7 days'` pg_cron job to bound it; a one-time backfill
+purge cleared **29,156** rows. Recorded in the deployment.md scheduled-
+jobs inventory + AGENT.md §6.
+
+Net: jobagent now launches on Supabase free (500 MB cap, NO automated
+backups, 7-day idle auto-pause — all three are real launch risks, see
+AGENT.md §3), at 171 MB, lexical search, $0 Supabase. Revisit Pro +
+semantic search once there's revenue.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -217,7 +217,7 @@ Each `saved_jobs` row stores one shortlisted posting per user and normalized job
 
 Each `resume_builder_sessions` row stores one in-progress conversational resume-builder draft per user with a 7-day TTL refreshed on every save. A `pg_cron` job (`cleanup-expired-resume-builder-sessions`) hard-deletes expired rows every 5 min and RLS hides expired rows from per-user queries; see [ADR-016](adr/ADR-016-conversational-llm-resume-builder.md).
 
-Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md).
+Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md). **As of 2026-06-23, production runs in LEAN MODE (Supabase free-tier downgrade): the `embedding` column + HNSW index + hybrid RPC are DROPPED in prod and the flag is `false`, so live search is lexical-only. The hybrid capability remains in the codebase + the SQL files; it's a restore-on-Pro operation, not a deleted feature. See the "Hybrid-search lean/full switch" runbook in [deployment.md](deployment.md) and DEVLOG Day 82.**
 
 `aijobagent_run_traces` is an append-only cost-attribution table — one row per successful LLM call (`user_id`, `model`, `task`, `prompt_tokens`, `completion_tokens`, `cost_usd`, `created_at`). Writes are best-effort: a missing table or a write error never propagates to the user-facing path. It is the canonical answer to "what is OpenAI spend doing", separate from the Sentry/PostHog telemetry surface.
 
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -9,13 +9,14 @@ into `architecture.md` or an ADR lives here.
 
 This is the authoritative list of everything that runs on a schedule
 in production. **No scheduled job runs a chat / completion model.**
-The cache refresh does spend a small amount on *embeddings* — see the
-`cached_jobs_refresh` row. Audited 2026-05-22.
+Audited 2026-05-22; updated 2026-06-23 (lean-mode downgrade + the new
+`purge-cron-run-history` job).
 
 | Job | Where | Schedule | What it does | LLM cost |
 | --- | --- | --- | --- | --- |
-| `cached_jobs_refresh` | Supabase `pg_cron` + `pg_net` | every 4h (`0 */4 * * *`, six runs/day) | POSTs `/api/admin/refresh-cache` so the backend re-polls the Greenhouse/Lever/Ashby/Workday boards into `cached_jobs` | **~\$0** — job-board APIs are free; embeds only *newly-cached* jobs with `text-embedding-3-small` when `JOB_SEARCH_HYBRID_ENABLED` is on (a few cents/day). No chat-model call |
+| `cached_jobs_refresh` | Supabase `pg_cron` + `pg_net` | every 4h (`0 */4 * * *`, six runs/day) | POSTs `/api/admin/refresh-cache` so the backend re-polls the Greenhouse/Lever/Ashby/Workday boards into `cached_jobs` | **\$0** now — job-board APIs are free, and **embed-on-write is OFF since the 2026-06-23 lean-mode downgrade** (`JOB_SEARCH_HYBRID_ENABLED=false`, embedding column dropped). When the project is back on Pro + hybrid is re-enabled, this row embeds newly-cached jobs with `text-embedding-3-small` (a few ¢/day). No chat-model call either way |
 | `cleanup-expired-resume-builder-sessions` | Supabase `pg_cron` | every 5 min (`*/5 * * * *`) | Hard-deletes `resume_builder_sessions` rows past their 7-day TTL | **\$0** — plain SQL `DELETE`, no LLM |
+| `purge-cron-run-history` | Supabase `pg_cron` | daily (`30 3 * * *`) | `DELETE FROM cron.job_run_details WHERE end_time < now() - interval '7 days'`. Bounds pg_cron's own run-history table, which grows unboundedly (the 5-min resume-builder cleanup alone adds ~288 rows/day). Added 2026-06-23 after it had reached ~30k stale rows | **\$0** — plain SQL `DELETE`, no LLM |
 | `backend.maintenance` (saved-workspace retention sweep) | VPS host crontab — **verify against prod** | daily (`0 3 * * *` assumed) | Runs `python -m backend.maintenance` → `sweep_expired_workspaces`: deletes saved workspaces past their per-tier retention (Free > 7d, Pro > 30d). Now check-ins to the `saved-workspaces-retention` Sentry cron monitor (M22) | **\$0** — plain `DELETE`, no LLM |
 | `backend.nightly_eval` | VPS host crontab | **NOT INSTALLED** | Would run the LLM quality eval; deliberately not scheduled | would be ~\$0.25/run if enabled |
 
@@ -273,36 +274,38 @@ botched drop can't lose the exact index config. The lean-mode script
 only ever drops the *semantic* add-ons (embedding column + HNSW); it
 never touches the base table or the lexical indexes.
 
-**IMPORTANT — the 500 MB cap is NOT a current constraint.** The
-jobagent Supabase org (`Job_Application_Copilot`) is on the **Pro plan
-(8 GB database, daily backups, no idle auto-pause)**. So none of the
-"over the cap" framing below is a live emergency — it all describes the
-*future* scenario where the org is downgraded **Pro → free** to save the
-~$25/mo (no users yet). The lean/full switch + the VACUUM lever exist to
-make that downgrade possible and survivable, not to rescue a current
-overage.
-
-**Live storage check + VACUUM (jobagent prod, 2026-06-23):** the DB was
-**504 MB** (fine on Pro's 8 GB, but over the *free*-tier 500 MB number).
-`cached_jobs` was 485 MB of that (14,085 rows, all embedded): heap+TOAST
-335 MB, indexes 150 MB; the semantic layer alone = the `embedding`
-column (83 MB live) + the HNSW index (110 MB) = **193 MB**. The
-heap+TOAST (335 MB) far exceeded the live column data (~220 MB) — ~115
-MB of dead-tuple bloat from the 4-hourly refresh churn (autovacuum
-reclaims-for-reuse but never shrinks the file). A **`VACUUM FULL
-public.cached_jobs` was run** to reclaim it: **504 → 381 MB** (−123 MB;
-heap+TOAST 335→241, indexes 150→121). No data lost, semantic search
-intact. This made the project **free-downgrade-eligible** (free limits
-all clear: DB 381/500 MB, storage 0/1 GB, MAU 0/50k, 1 project/org).
-
-Two levers, two purposes:
-- **Standalone `VACUUM FULL`** (what was run) — reclaims churn bloat,
-  keeps semantic search. ~2-min ACCESS EXCLUSIVE lock. The churn
-  re-accumulates ~30-60 MB/month, so it's a periodic reset (re-run
-  when the DB nears ~480 MB), not a permanent fix.
-- **Lean mode** (the switch above) — drops embeddings + HNSW to ~300 MB
-  durably; loses semantic concept-matching. The fallback for staying
-  under 500 MB long-term without re-VACUUMing or paying for Pro.
+**CURRENT STATE (2026-06-23): on the FREE plan, LEAN MODE ACTIVE.** The
+jobagent Supabase org was downgraded **Pro → free** to cut the ~$25/mo
+pre-revenue, and lean mode was executed to fit the 500 MB cap. So job
+search is **lexical-only right now** — the "full / hybrid" half below is
+the *restore* path for if/when the project goes back to Pro, not the
+live config.
+
+**What happened (chronological, jobagent prod 2026-06-23):**
+1. Live check found the DB at **470 MB** (climbing — a VACUUM the day
+   before had dipped it to 381 MB but the 4-hourly refresh churn refilled
+   it within a day; **a plain VACUUM is NOT a durable lever on this
+   table** — it re-bloats to its ~485 MB natural high-water-mark in days).
+2. Downgraded the org to free.
+3. Executed lean mode: `JOB_SEARCH_HYBRID_ENABLED=false` + redeployed api
+   (embed-on-write off, search → lexical) → dropped the HNSW index +
+   hybrid RPC + `embedding` column (migration
+   `cached_jobs_lean_mode_drop_semantic_layer`) → `VACUUM FULL`.
+4. Result: **470 → 171 MB** (cached_jobs 451 → 151 MB), all 14,083 jobs
+   intact, lexical `search_cached_jobs_ranked` serving live results.
+   **330 MB under the cap.**
+
+**Why lean mode holds where VACUUM didn't:** with embed-on-write off and
+no HNSW index, the refresh now only touches cheap text columns — there's
+no per-row 1536-dim vector churn and no HNSW graph to bloat. The DB sits
+at ~150-200 MB and *stays* there. That is the whole point of lean vs a
+VACUUM (which the churn undoes in days).
+
+Also corrected: the prior version of this runbook claimed the churn
+re-accumulates "~30-60 MB/month" and that a periodic VACUUM is a lever.
+**Both were wrong** — re-bloat is ~90 MB/*day* back to the ~485 MB
+high-water-mark, so VACUUM buys ~2 days, not months. Lean mode (or Pro)
+is the only durable answer.
 
 **Downgrade-at-launch caveat:** free tier (a) **auto-pauses a project
 after 7 days of inactivity** — a launched-but-quiet app sleeps and the