Skip to content

Commit c450d22

Browse files
committed
docs: record Day 82 — free downgrade, lean mode executed, cron-log purge
DEVLOG Day 82: the org was downgraded Supabase Pro -> free, lean mode was executed (470 -> 171 MB, embeddings + HNSW + hybrid RPC dropped, JOB_SEARCH_HYBRID_ENABLED=false, search now lexical-only, all 14,083 jobs intact), the landing copy was corrected to drop the now-false semantic/hybrid claims (commit 470e62f), and a daily purge-cron-run-history pg_cron job was added (cleared 29,156 stale rows). Also documents the corrected lesson: a plain VACUUM is NOT a durable lever here (the churn re-bloats ~90 MB/day back to the ~485 MB high-water-mark, not 30-60 MB/month as Day 81 wrongly claimed). deployment.md: - scheduled-jobs inventory: add purge-cron-run-history; note embed-on-write is OFF under lean mode. - hybrid-search lean/full runbook: flipped from "on Pro, lean is hypothetical" to "CURRENT STATE: on free, lean ACTIVE"; the full/ hybrid half is now the restore-on-Pro path. Corrected the wrong "VACUUM buys 30-60 MB/month" claim. architecture.md: note that prod runs lean (embedding column + HNSW + hybrid RPC dropped, flag false, lexical-only search) as of 2026-06-23; the hybrid capability remains in code as a restore-on-Pro operation. (AGENT.md §3 + §6 updated locally too — free plan, lean active, ~171 MB, the new cron job; untracked working briefing, not in this commit.)
1 parent 470e62f commit c450d22

3 files changed

Lines changed: 87 additions & 34 deletions

File tree

docs/DEVLOG.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4232,3 +4232,53 @@ to click — the docs leave it downgrade-ready.
42324232
No application code changed this day — the flag wiring was already in
42334233
`src/cached_jobs_store.py` + `src/config.py`; everything else is SQL +
42344234
docs + one maintenance command.
4235+
4236+
## Day 82: downgraded to free + executed lean mode + cron-log purge
4237+
4238+
The Day 81 framing ("on Pro, 504 MB is fine, lean mode is hypothetical")
4239+
flipped to reality: the org was downgraded **Pro → free** to cut the
4240+
~$25/mo pre-revenue, and lean mode was executed for real.
4241+
4242+
**The VACUUM-isn't-durable lesson.** The Day 81 VACUUM dipped the DB
4243+
504 → 381 MB, but within ONE day the 4-hourly refresh churn refilled it
4244+
to **470 MB** (row count flat at 14k, dead_tuples 0 — the file
4245+
high-water-mark just re-grew, ~90 MB/day back toward its ~485 MB natural
4246+
size). So the Day 81 estimate of "2-4 months runway / ~30-60 MB per
4247+
month" was wrong by an order of magnitude. On a table that churns 6×/day
4248+
with embeddings, a plain VACUUM buys ~2 days, not months. Corrected the
4249+
deployment.md runbook accordingly.
4250+
4251+
**Lean mode, executed (jobagent prod):** flipped
4252+
`JOB_SEARCH_HYBRID_ENABLED=false` in the VPS `.env` + recreated the api
4253+
container (embed-on-write off, search degrades to the lexical
4254+
`search_cached_jobs_ranked` RPC) → applied migration
4255+
`cached_jobs_lean_mode_drop_semantic_layer` (drop HNSW index + hybrid
4256+
RPC + `embedding` column) → `VACUUM FULL`. Result: **470 → 171 MB**
4257+
(`cached_jobs` 451 → 151 MB), all 14,083 jobs intact, **330 MB under the
4258+
500 MB cap**, live lexical search verified returning real results. This
4259+
is durable where the VACUUM wasn't: no embed-on-write churn + no HNSW
4260+
bloat means the DB stays ~150-200 MB instead of re-climbing. Reversible
4261+
back to semantic on a future Pro upgrade (pgvector.sql + backfill (~2¢) +
4262+
hybrid.sql + flag on).
4263+
4264+
**Landing copy corrected (commit `470e62f`).** Search is lexical-only
4265+
now, but the landing still advertised "Semantic job search" / "Hybrid
4266+
lexical + semantic search understands what you mean" — false at launch.
4267+
Swapped to accurate copy for the real Tier 1 feature: the hero pill is
4268+
now "Synonym-aware search" and the bento describes the genuine
4269+
synonym/abbreviation expansion ("SWE" finds "software engineer", "PM"
4270+
finds "product manager", via `src/job_search_synonyms.py`). tsc + eslint
4271+
clean.
4272+
4273+
**New cron `purge-cron-run-history` (daily `30 3 * * *`).** pg_cron's
4274+
own `cron.job_run_details` run-history table grows unboundedly — the
4275+
5-min resume-builder cleanup alone adds ~288 rows/day — and had reached
4276+
~30k stale rows (5.6 MB). Added a daily `DELETE ... WHERE end_time <
4277+
now() - interval '7 days'` pg_cron job to bound it; a one-time backfill
4278+
purge cleared **29,156** rows. Recorded in the deployment.md scheduled-
4279+
jobs inventory + AGENT.md §6.
4280+
4281+
Net: jobagent now launches on Supabase free (500 MB cap, NO automated
4282+
backups, 7-day idle auto-pause — all three are real launch risks, see
4283+
AGENT.md §3), at 171 MB, lexical search, $0 Supabase. Revisit Pro +
4284+
semantic search once there's revenue.

docs/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ Each `saved_jobs` row stores one shortlisted posting per user and normalized job
217217

218218
Each `resume_builder_sessions` row stores one in-progress conversational resume-builder draft per user with a 7-day TTL refreshed on every save. A `pg_cron` job (`cleanup-expired-resume-builder-sessions`) hard-deletes expired rows every 5 min and RLS hides expired rows from per-user queries; see [ADR-016](adr/ADR-016-conversational-llm-resume-builder.md).
219219

220-
Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md).
220+
Each `cached_jobs` row holds one upstream posting keyed on `(source, job_id)`. The table has GENERATED STORED columns (`work_mode`, `employment_type_norm`) backing the dropdown filters, `removed_at` tombstones for upstream-closed jobs the user has bookmarked, and an `embedding vector(1536)` column (pgvector, HNSW cosine index) for semantic search. A `pg_cron` + `pg_net` schedule (`cached_jobs_refresh_4h`) POSTs to `/admin/refresh-cache` every 4 hours, six times a day (see `docs/sql/job_cache_cron_setup.sql` for the template — production runs `0 */4 * * *`). Search is two-tier: the lexical `search_cached_jobs_ranked` RPC ([ADR-014](adr/ADR-014-postgres-rpc-for-ranked-search.md)) and the hybrid `search_cached_jobs_hybrid` RPC, which fuses that lexical ranking with a pgvector semantic ranking via Reciprocal Rank Fusion. The hybrid path is gated behind the `JOB_SEARCH_HYBRID_ENABLED` flag and degrades to lexical on any failure; see [ADR-033](adr/ADR-033-hybrid-job-search.md). **As of 2026-06-23, production runs in LEAN MODE (Supabase free-tier downgrade): the `embedding` column + HNSW index + hybrid RPC are DROPPED in prod and the flag is `false`, so live search is lexical-only. The hybrid capability remains in the codebase + the SQL files; it's a restore-on-Pro operation, not a deleted feature. See the "Hybrid-search lean/full switch" runbook in [deployment.md](deployment.md) and DEVLOG Day 82.**
221221

222222
`aijobagent_run_traces` is an append-only cost-attribution table — one row per successful LLM call (`user_id`, `model`, `task`, `prompt_tokens`, `completion_tokens`, `cost_usd`, `created_at`). Writes are best-effort: a missing table or a write error never propagates to the user-facing path. It is the canonical answer to "what is OpenAI spend doing", separate from the Sentry/PostHog telemetry surface.
223223

docs/deployment.md

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,14 @@ into `architecture.md` or an ADR lives here.
99

1010
This is the authoritative list of everything that runs on a schedule
1111
in production. **No scheduled job runs a chat / completion model.**
12-
The cache refresh does spend a small amount on *embeddings* — see the
13-
`cached_jobs_refresh` row. Audited 2026-05-22.
12+
Audited 2026-05-22; updated 2026-06-23 (lean-mode downgrade + the new
13+
`purge-cron-run-history` job).
1414

1515
| Job | Where | Schedule | What it does | LLM cost |
1616
| --- | --- | --- | --- | --- |
17-
| `cached_jobs_refresh` | Supabase `pg_cron` + `pg_net` | every 4h (`0 */4 * * *`, six runs/day) | POSTs `/api/admin/refresh-cache` so the backend re-polls the Greenhouse/Lever/Ashby/Workday boards into `cached_jobs` | **~\$0** — job-board APIs are free; embeds only *newly-cached* jobs with `text-embedding-3-small` when `JOB_SEARCH_HYBRID_ENABLED` is on (a few cents/day). No chat-model call |
17+
| `cached_jobs_refresh` | Supabase `pg_cron` + `pg_net` | every 4h (`0 */4 * * *`, six runs/day) | POSTs `/api/admin/refresh-cache` so the backend re-polls the Greenhouse/Lever/Ashby/Workday boards into `cached_jobs` | **\$0** now — job-board APIs are free, and **embed-on-write is OFF since the 2026-06-23 lean-mode downgrade** (`JOB_SEARCH_HYBRID_ENABLED=false`, embedding column dropped). When the project is back on Pro + hybrid is re-enabled, this row embeds newly-cached jobs with `text-embedding-3-small` (a few ¢/day). No chat-model call either way |
1818
| `cleanup-expired-resume-builder-sessions` | Supabase `pg_cron` | every 5 min (`*/5 * * * *`) | Hard-deletes `resume_builder_sessions` rows past their 7-day TTL | **\$0** — plain SQL `DELETE`, no LLM |
19+
| `purge-cron-run-history` | Supabase `pg_cron` | daily (`30 3 * * *`) | `DELETE FROM cron.job_run_details WHERE end_time < now() - interval '7 days'`. Bounds pg_cron's own run-history table, which grows unboundedly (the 5-min resume-builder cleanup alone adds ~288 rows/day). Added 2026-06-23 after it had reached ~30k stale rows | **\$0** — plain SQL `DELETE`, no LLM |
1920
| `backend.maintenance` (saved-workspace retention sweep) | VPS host crontab — **verify against prod** | daily (`0 3 * * *` assumed) | Runs `python -m backend.maintenance``sweep_expired_workspaces`: deletes saved workspaces past their per-tier retention (Free > 7d, Pro > 30d). Now check-ins to the `saved-workspaces-retention` Sentry cron monitor (M22) | **\$0** — plain `DELETE`, no LLM |
2021
| `backend.nightly_eval` | VPS host crontab | **NOT INSTALLED** | Would run the LLM quality eval; deliberately not scheduled | would be ~\$0.25/run if enabled |
2122

@@ -273,36 +274,38 @@ botched drop can't lose the exact index config. The lean-mode script
273274
only ever drops the *semantic* add-ons (embedding column + HNSW); it
274275
never touches the base table or the lexical indexes.
275276

276-
**IMPORTANT — the 500 MB cap is NOT a current constraint.** The
277-
jobagent Supabase org (`Job_Application_Copilot`) is on the **Pro plan
278-
(8 GB database, daily backups, no idle auto-pause)**. So none of the
279-
"over the cap" framing below is a live emergency — it all describes the
280-
*future* scenario where the org is downgraded **Pro → free** to save the
281-
~$25/mo (no users yet). The lean/full switch + the VACUUM lever exist to
282-
make that downgrade possible and survivable, not to rescue a current
283-
overage.
284-
285-
**Live storage check + VACUUM (jobagent prod, 2026-06-23):** the DB was
286-
**504 MB** (fine on Pro's 8 GB, but over the *free*-tier 500 MB number).
287-
`cached_jobs` was 485 MB of that (14,085 rows, all embedded): heap+TOAST
288-
335 MB, indexes 150 MB; the semantic layer alone = the `embedding`
289-
column (83 MB live) + the HNSW index (110 MB) = **193 MB**. The
290-
heap+TOAST (335 MB) far exceeded the live column data (~220 MB) — ~115
291-
MB of dead-tuple bloat from the 4-hourly refresh churn (autovacuum
292-
reclaims-for-reuse but never shrinks the file). A **`VACUUM FULL
293-
public.cached_jobs` was run** to reclaim it: **504 → 381 MB** (−123 MB;
294-
heap+TOAST 335→241, indexes 150→121). No data lost, semantic search
295-
intact. This made the project **free-downgrade-eligible** (free limits
296-
all clear: DB 381/500 MB, storage 0/1 GB, MAU 0/50k, 1 project/org).
297-
298-
Two levers, two purposes:
299-
- **Standalone `VACUUM FULL`** (what was run) — reclaims churn bloat,
300-
keeps semantic search. ~2-min ACCESS EXCLUSIVE lock. The churn
301-
re-accumulates ~30-60 MB/month, so it's a periodic reset (re-run
302-
when the DB nears ~480 MB), not a permanent fix.
303-
- **Lean mode** (the switch above) — drops embeddings + HNSW to ~300 MB
304-
durably; loses semantic concept-matching. The fallback for staying
305-
under 500 MB long-term without re-VACUUMing or paying for Pro.
277+
**CURRENT STATE (2026-06-23): on the FREE plan, LEAN MODE ACTIVE.** The
278+
jobagent Supabase org was downgraded **Pro → free** to cut the ~$25/mo
279+
pre-revenue, and lean mode was executed to fit the 500 MB cap. So job
280+
search is **lexical-only right now** — the "full / hybrid" half below is
281+
the *restore* path for if/when the project goes back to Pro, not the
282+
live config.
283+
284+
**What happened (chronological, jobagent prod 2026-06-23):**
285+
1. Live check found the DB at **470 MB** (climbing — a VACUUM the day
286+
before had dipped it to 381 MB but the 4-hourly refresh churn refilled
287+
it within a day; **a plain VACUUM is NOT a durable lever on this
288+
table** — it re-bloats to its ~485 MB natural high-water-mark in days).
289+
2. Downgraded the org to free.
290+
3. Executed lean mode: `JOB_SEARCH_HYBRID_ENABLED=false` + redeployed api
291+
(embed-on-write off, search → lexical) → dropped the HNSW index +
292+
hybrid RPC + `embedding` column (migration
293+
`cached_jobs_lean_mode_drop_semantic_layer`) → `VACUUM FULL`.
294+
4. Result: **470 → 171 MB** (cached_jobs 451 → 151 MB), all 14,083 jobs
295+
intact, lexical `search_cached_jobs_ranked` serving live results.
296+
**330 MB under the cap.**
297+
298+
**Why lean mode holds where VACUUM didn't:** with embed-on-write off and
299+
no HNSW index, the refresh now only touches cheap text columns — there's
300+
no per-row 1536-dim vector churn and no HNSW graph to bloat. The DB sits
301+
at ~150-200 MB and *stays* there. That is the whole point of lean vs a
302+
VACUUM (which the churn undoes in days).
303+
304+
Also corrected: the prior version of this runbook claimed the churn
305+
re-accumulates "~30-60 MB/month" and that a periodic VACUUM is a lever.
306+
**Both were wrong** — re-bloat is ~90 MB/*day* back to the ~485 MB
307+
high-water-mark, so VACUUM buys ~2 days, not months. Lean mode (or Pro)
308+
is the only durable answer.
306309

307310
**Downgrade-at-launch caveat:** free tier (a) **auto-pauses a project
308311
after 7 days of inactivity** — a launched-but-quiet app sleeps and the

0 commit comments

Comments
 (0)