|
1 | 1 | # Depot crash course |
2 | 2 |
|
3 | | -How the v2 Depot engine reads, writes, compacts, and fences. Read this before changing anything in `engine/packages/depot/`. |
| 3 | +How the Depot SQLite backend reads, writes, compacts, and fences branchable database storage. Read this before changing anything in `engine/packages/depot/`. |
4 | 4 |
|
5 | | -For VFS-side parity rules (native Rust ↔ WASM TS), see [sqlite-vfs.md](sqlite-vfs.md). This doc is about the storage backend that the VFS talks to. |
| 5 | +For VFS-side parity rules, see [sqlite-vfs.md](sqlite-vfs.md). For exact key formats, see [sqlite/storage-structure.md](sqlite/storage-structure.md). |
6 | 6 |
|
7 | | -## Storage layout |
| 7 | +## Storage model |
8 | 8 |
|
9 | | -Every actor's data lives in UDB under per-actor prefix `[0x02][actor_id]` and four kinds of suffix keys: |
| 9 | +Depot stores SQLite pages in UDB first. S3 is optional cold storage for workflow-published shard objects. |
10 | 10 |
|
11 | | -| Suffix | Holds | Role | |
| 11 | +| Row family | Holds | Owner | |
12 | 12 | |---|---|---| |
13 | | -| `META` | `DBHead` blob (vbare-encoded) | Per-actor header. Single key. | |
14 | | -| `PIDX/delta/{pgno: u32 BE}` | `txid: u64 BE` | Page index. Routes pgno → which DELTA owns it. | |
15 | | -| `DELTA/{txid: u64 BE}/{chunk_idx: u32 BE}` | LTX blob chunks | Per-commit page payloads. Multi-chunk because UDB chunks values internally past ~10 KB. | |
16 | | -| `SHARD/{shard_id: u32 BE}` | LTX blob | Cold compacted state. 64 pages per shard. | |
17 | | - |
18 | | -Page bytes for any pgno live in **exactly one of two places**: a DELTA blob (recent commit, not yet compacted) or a SHARD blob (compacted cold state). |
19 | | - |
20 | | -`DBHead` (`engine/packages/depot/src/types.rs`) load-bearing fields: |
21 | | - |
22 | | -- `generation` — fence. Bumps on takeover. |
23 | | -- `head_txid` — last committed txid. |
24 | | -- `next_txid` — reserved-but-not-yet-committed counter. `next_txid > head_txid` always. |
25 | | -- `materialized_txid` — last compaction watermark. `head_txid - materialized_txid` is delta lag. |
26 | | -- `db_size_pages` — current DB EOF in pages. |
27 | | -- `depot_used` / `sqlite_max_storage` — quota. |
28 | | -- `page_size`, `shard_size` — fixed at 4096 and 64 respectively. |
| 13 | +| `DBPTR` / `BUCKET_PTR` | Current database and bucket branch pointers | Conveyer branch APIs | |
| 14 | +| `BUCKET_CATALOG` | Database membership facts in bucket branches | Conveyer branch APIs | |
| 15 | +| `BRANCHES` / `BUCKET_BRANCH` | Branch records, refcounts, pin floors, lifecycle generations | Conveyer, GC, workflow checks | |
| 16 | +| `BR/{branch}/META/head` | Current database head | Commit path | |
| 17 | +| `BR/{branch}/COMMITS` and `BR/{branch}/VTX` | Commit metadata and versionstamp-to-txid lookup | Commit path | |
| 18 | +| `BR/{branch}/PIDX` and `BR/{branch}/DELTA` | Recent page-owner index and LTX delta chunks | Commit path | |
| 19 | +| `BR/{branch}/SHARD` | Reader-visible hot shard versions and cold-backed shard-cache rows | Workflow manager, cache fill, reclaimer | |
| 20 | +| `BR/{branch}/CMP/*` | Workflow manifest, cold refs, retired cold objects, staged hot output | Workflow manager and companions | |
| 21 | +| `BR/{branch}/PITR_INTERVAL` | Automatic PITR interval coverage rows | Workflow hot install and reclaim | |
| 22 | +| `RESTORE_POINT` and `DB_PIN` | User retained restore points and exact history pins | Restore point APIs and workflow proof | |
| 23 | + |
| 24 | +The main invariant is simple: **commits write deltas directly to UDB; workflow compaction is the only publish/delete authority for compaction output.** |
29 | 25 |
|
30 | 26 | ## Read path |
31 | 27 |
|
32 | | -When SQLite asks the VFS for page N, the VFS calls `get_pages(actor_id, generation, [pgno])` (`engine/packages/depot/src/read.rs`). |
| 28 | +Reads resolve the database pointer to a database branch, build a branch-aware read plan, and fetch each page through the hot path first: |
33 | 29 |
|
| 30 | +```text |
| 31 | +1. Read branch head or fork head metadata. |
| 32 | +2. Return missing for pages above EOF. |
| 33 | +3. Check PIDX and DELTA first. |
| 34 | +4. If the DELTA is absent or reclaimed, fall back to the newest SHARD at or below the read cap. |
| 35 | +5. If no FDB SHARD covers the page, locate a matching workflow CMP/cold_shard ref and read the cold object. |
| 36 | +6. If the cold read succeeds, enqueue a bounded background fill to restore the matching FDB SHARD cache row. |
34 | 37 | ``` |
35 | | -1. Read META in-tx → fence + db_size_pages + shard_size |
36 | | -2. If N > db_size_pages → return missing (above EOF) |
37 | | -3. Look up N in PIDX: |
38 | | - - Hit (txid T) → page N is in DELTA T |
39 | | - - Miss → page N is in SHARD (N / 64) |
40 | | -4. Read the chosen blob, decode LTX, extract page N's bytes |
41 | | -5. Stale-PIDX fallback: if PIDX said DELTA T but DELTA T is missing |
42 | | - (compaction deleted it), fall back to SHARD (N / 64) |
43 | | -``` |
44 | 38 |
|
45 | | -PIDX is the **routing table**. Without it, you'd scan every DELTA blob to find each pgno. With it, page N → one PIDX lookup → one blob read. |
| 39 | +Cold storage is optional. If only cold coverage can satisfy a page and the `Db` has no configured cold tier, reads fail with `ShardCoverageMissing` instead of inventing zero-filled bytes. |
| 40 | + |
| 41 | +The in-process PIDX, branch-id, ancestry, and shard-cache fill queues are perf caches only. Correctness comes from UDB rows and workflow revalidation. |
46 | 42 |
|
47 | 43 | ## Write path |
48 | 44 |
|
49 | | -When SQLite commits a transaction with N dirty pages, the VFS calls `commit(actor_id, generation, head_txid, dirty_pages, ...)` (`engine/packages/depot/src/commit.rs`). |
| 45 | +SQLite commits call Depot through the conveyer path: |
50 | 46 |
|
| 47 | +```text |
| 48 | +1. Resolve DBPTR and read the current branch head in the UDB transaction. |
| 49 | +2. Encode dirty pages into LTX DELTA chunks. |
| 50 | +3. Write COMMITS, VTX, DELTA, and PIDX rows. |
| 51 | +4. Update META/head and quota counters. |
| 52 | +5. After commit, update SQLITE_CMP_DIRTY and send a throttled DeltasAvailable wake when lag crosses thresholds. |
51 | 53 | ``` |
52 | | -1. Read META in-tx → fence (generation + head_txid) + allocate txid T |
53 | | -2. Encode all dirty pages into one LTX blob |
54 | | - → write to DELTA/{T}/0, DELTA/{T}/1, ... |
55 | | -3. For each dirty page N: write PIDX/delta/{N} = T (overwrites prior owner) |
56 | | -4. Update META: head_txid=T, next_txid=T+1, db_size_pages, depot_used |
57 | | -5. Commit UDB tx |
58 | | -``` |
59 | | - |
60 | | -The PIDX writes are how a commit "claims" pages. Most-recent PIDX entry wins the read. |
61 | | - |
62 | | -After commit succeeds, prior owners of those pgnos are now orphaned in their old DELTAs (no PIDX entry references them anymore). Compaction will eventually fold the orphans into shards. |
63 | 54 |
|
64 | | -## Compaction (the janitor's job) |
| 55 | +The commit path does **not** publish SHARD rows, upload cold objects, or delete old history. It only records new committed history and wakes workflow compaction. |
65 | 56 |
|
66 | | -``` |
67 | | -1. Read META in-tx, PIDX, and the K oldest unmaterialized DELTAs |
68 | | -2. Group their pages by shard_id (= pgno / 64) |
69 | | -3. For each affected shard: |
70 | | - - Read existing SHARD blob, merge in newer page versions, rewrite SHARD |
71 | | - - Delete PIDX entries for pages that just got folded |
72 | | -4. Delete the K old DELTA blobs (no PIDX still references them) |
73 | | -5. Update META: materialized_txid = highest folded txid, |
74 | | - depot_used adjusted for bytes freed |
75 | | -``` |
| 57 | +## Workflow compaction |
76 | 58 |
|
77 | | -After compaction, those pages are no longer in PIDX → reads fall through to the shard. |
| 59 | +Each active database branch has one DB manager workflow plus hot, cold, and reclaimer companions, all unique by database branch id. |
78 | 60 |
|
79 | | -## Where PIDX is used |
| 61 | +The manager owns planning and durable publication: |
80 | 62 |
|
81 | | -Three paths: |
| 63 | +- Hot jobs stage LTX shard blobs under `CMP/stage/{job_id}/hot_shard`; the manager validates the active job, copies output to reader-visible `SHARD`, advances `CMP/root`, writes selected `PITR_INTERVAL` rows, and compare-clears matching PIDX. |
| 64 | +- Cold jobs upload deterministic objects at `db/{branch}/shard/{shard_id}/{txid}-{job_id}-{hash}.ltx`; the manager publishes `CMP/cold_shard` refs only after revalidating branch lifecycle, manifest generation, pins, proof state, and covered inputs. |
| 65 | +- Reclaim jobs delete hot rows only after the manager proves replacement coverage. They also retire cold refs, wait the grace window, mark deletes issued, delete exact S3 keys, and leave completed retired records so object keys are not republished. |
| 66 | +- Shard-cache eviction is a reclaimer lane. It clears only FDB `SHARD` rows that have matching `CMP/cold_shard` refs and are not retained by restore points, forks, or unexpired PITR interval coverage. |
82 | 67 |
|
83 | | -1. **Reads** — routing table. Every `get_pages` consults PIDX. |
84 | | -2. **Commits** — every dirty page writes a new PIDX row, overwriting the prior owner. |
85 | | -3. **Compaction** — reads PIDX to find what to fold, deletes PIDX rows for folded pages. |
| 68 | +`CMP/root` watermarks are scheduling summaries, not deletion proof by themselves. Deletes re-read the exact pins, PIDX dependencies, SHARD coverage, lifecycle generation, and manifest generation inside the delete transaction. |
86 | 69 |
|
87 | | -### The in-RAM PIDX cache |
| 70 | +## PITR and restore |
88 | 71 |
|
89 | | -`SqliteEngine.page_indices: scc::HashMap<actor_id, DeltaPageIndex>` (`engine/packages/depot/src/page_index.rs`) is a RAM snapshot of the `PIDX/delta/*` prefix. |
| 72 | +Automatic timestamp restore coverage is stored as `PITR_INTERVAL` rows selected during hot compaction from commit wall-clock timestamps and the effective bucket/database PITR policy. Expired interval rows are soft pins until reclaim compare-clears them. |
90 | 73 |
|
91 | | -- **Cold cache:** on `get_pages`, scan PIDX prefix in-tx, populate cache for next time. |
92 | | -- **Warm cache:** skip the scan, look up in RAM. |
93 | | -- **Commit:** update cache after the UDB write succeeds (add/overwrite the new pgno → txid mappings). |
94 | | -- **Stale entry:** cache says DELTA T owns pgno N but compaction deleted T. The read misses the DELTA blob, falls back to SHARD (`read.rs:144-150`), evicts the stale row. |
| 74 | +Restore points are retained user tokens. Creating a restore point resolves a `SnapshotSelector` to exact branch, txid, versionstamp, and wall-clock metadata, then writes a `RestorePointRecord` and `DB_PIN(kind=RestorePoint)`. Deleting it removes that hard pin and recomputes branch pin floors. |
95 | 75 |
|
96 | | -The cache is **safe to be stale** because PIDX→DELTA misses always fall back to SHARD, and shards are the long-term home. Correctness lives in UDB; the cache is perf only. |
| 76 | +Fork and restore use the same primitive: resolve a snapshot selector, derive a branch at that exact point, and let the caller decide whether to keep a fork or move the database pointer. |
97 | 77 |
|
98 | 78 | ## Cross-references |
99 | 79 |
|
| 80 | +- Key layout: [sqlite/storage-structure.md](sqlite/storage-structure.md) |
| 81 | +- Component ownership: [sqlite/components.md](sqlite/components.md) |
100 | 82 | - VFS parity rules: [sqlite-vfs.md](sqlite-vfs.md) |
101 | 83 | - Storage metrics: [SQLITE_METRICS.md](SQLITE_METRICS.md) |
102 | | -- Engine-wide CLAUDE notes on SQLite quirks: `engine/CLAUDE.md` `## Depot tests` and `## Pegboard Envoy` |
0 commit comments