Skip to content

Commit e55d70c

Browse files
Phase 7: secondary indexes — entry contract, DDL, atomic maintenance, base↔index validation; writer reclaim fix
1 parent 5db51bb commit e55d70c

17 files changed

Lines changed: 1398 additions & 93 deletions

File tree

CHANGELOG.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,47 @@ under a category (`Added` / `Changed` / `Fixed` / `Removed` / `Security`).
88

99
## [Unreleased]
1010

11+
### Phase 7 — Indexing
12+
13+
#### Added
14+
- `index`: the secondary-index entry contract — keys are the
15+
order-preserving encoding of the indexed columns (+ encoded-PK suffix for
16+
non-unique indexes); entry values are the encoded PK; unique indexes skip
17+
rows with NULL indexed columns, non-unique include them (`DECISIONS.md`
18+
D18) — plus probe/prefix-scan bounds and thin maintenance ops over
19+
`txn::WriteCtx`.
20+
- `catalog`: `IndexDef` (single-column, composite, unique) persisted in the
21+
table definition (codec v2; v1 records still decode); per-index
22+
`("iroot", table, index)` root entries committed atomically with the base
23+
root.
24+
- Index DDL: `create_index` (validates, **backfills from existing rows**,
25+
rejects unique violations found in the data) and `drop_index` (frees the
26+
index tree; the implicit backing of a `unique` column cannot be dropped).
27+
- Automatic maintenance in the same write transaction as every insert /
28+
update / delete — updates touch only indexes whose keys changed; multi-row
29+
atomicity covers index entries (in-batch unique collisions included).
30+
- `unique` columns now create **implicit unique indexes**
31+
(`uniq_<table>_<column>`) and are enforced by index probes — D16's
32+
provisional scan probe is deleted.
33+
- `CatSnapshot::validate()`: structural validation of every tree plus an
34+
entry-for-entry **brute-force cross-check of every index against its base
35+
table**, over the snapshot's pinned version.
36+
- Exit-criteria tests: seeded random-DML property test (insert/update/delete
37+
against a model; every index brute-force-checked along the way), unique
38+
violations at insert/update/backfill, atomicity, page reclamation for
39+
dropped indexes/tables, pinned-snapshot consistency.
40+
41+
#### Changed
42+
- `txn`: `Snapshot::validate_tree` exposes the B+tree structural validator
43+
for trees under the pinned root.
44+
45+
#### Fixed
46+
- `txn`: the writer reclaimed watermark-cleared pages at the start of
47+
**every** batch, including batches that end up committing nothing (all
48+
transactions rejected) — leaving the in-memory freelist ahead of the disk
49+
until the next commit, which `validate()` reported as freelist corruption.
50+
Reclamation now runs only inside committing batches (`DECISIONS.md` D19).
51+
1152
### Phase 6 — Schema, catalog & constraints
1253

1354
#### Added

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

DECISIONS.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,47 @@ Per `PLAN.md` §1 rule 6, every resolution of an ambiguity or deviation from
55

66
---
77

8+
## D19 — Reclamation only runs inside a committing batch
9+
10+
**Phase:** 7 · **Status:** accepted (bug fix of Phase 4 behavior)
11+
12+
The Phase 4 writer reclaimed watermark-cleared pages at the **start of every
13+
batch**. A batch whose transactions are all rejected never commits — but
14+
reclamation had already mutated the pager's in-memory freelist, leaving it
15+
ahead of the disk (surfaced by `validate()` as freelist corruption; on a
16+
crash in that window the parked pages would simply leak, as they always do
17+
when the writer's in-memory park list dies — no disk state was ever wrong).
18+
19+
**Decision:** the writer reclaims **after applying a batch's jobs and only
20+
when the batch will commit**, so the freelist changes ride the same fsync
21+
pair. Pages a batch frees become reusable one batch later (instead of within
22+
the same batch) — a negligible cost. Found by Phase 7's `drop index` tests;
23+
regression-tested at the txn layer.
24+
25+
## D18 — Index entry contract: PK as value, NULL rows skip unique indexes
26+
27+
**Phase:** 7 · **Status:** accepted
28+
29+
`ARCHITECTURE.md` §3.6 fixes the key shape (encoded indexed columns, PK
30+
suffix when non-unique) but leaves the entry value and NULL semantics open.
31+
32+
**Decision:**
33+
- **Entry value = the encoded PK** (the base-tree key) for both unique and
34+
non-unique indexes: uniform, and resolving an entry to its row is one base
35+
lookup without re-deriving key suffixes.
36+
- **Unique indexes skip rows with any NULL indexed column** — NULLs never
37+
conflict (D17), and since unique entries carry no PK suffix, storing NULL
38+
rows would falsely collide. Non-unique indexes include NULL rows (nulls
39+
sort first, queryable).
40+
- **`unique` columns are enforced by an implicit single-column unique
41+
index** named `uniq_<table>_<column>` (replacing D16's provisional scan
42+
probe). The implicit backing cannot be dropped; `drop table` frees index
43+
trees with the base.
44+
- Index names are per-table; the catalog stores one `("iroot", table, index)`
45+
root entry per index, updated in the same commit as the base root. The
46+
table-definition codec is now **version 2** (v1 records still decode,
47+
index-less).
48+
849
## D17 — Write-path semantics the SPEC leaves open
950

1051
**Phase:** 6 · **Status:** accepted

crates/catalog/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ common.workspace = true
1414
pager.workspace = true
1515
btree.workspace = true
1616
types.workspace = true
17+
index.workspace = true
1718
txn.workspace = true
1819
thiserror.workspace = true
1920

crates/catalog/src/codec.rs

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@
88
use types::{decode_row, encode_row, TypeKind, Value};
99

1010
use crate::schema::{
11-
CheckExpr, CmpOp, ColumnDef, DefaultSpec, TableDef, UpdatePolicy, MAX_CHECK_DEPTH,
11+
CheckExpr, CmpOp, ColumnDef, DefaultSpec, IndexDef, TableDef, UpdatePolicy, MAX_CHECK_DEPTH,
1212
};
1313
use crate::{CatalogCorruption, CatalogError, Result};
1414

15-
const VERSION: u8 = 1;
15+
/// Version 2 added the secondary-index section (Phase 7); version-1 records
16+
/// (no indexes) still decode.
17+
const VERSION: u8 = 2;
18+
const VERSION_NO_INDEXES: u8 = 1;
1619

1720
const FLAG_NULLABLE: u8 = 1 << 0;
1821
const FLAG_UNIQUE: u8 = 1 << 1;
@@ -79,14 +82,23 @@ pub(crate) fn encode_table(def: &TableDef) -> Result<Vec<u8>> {
7982
for check in &def.checks {
8083
push_check(&mut out, check)?;
8184
}
85+
push_count(&mut out, def.indexes.len())?;
86+
for index in &def.indexes {
87+
push_str(&mut out, &index.name)?;
88+
out.push(u8::from(index.unique));
89+
push_count(&mut out, index.columns.len())?;
90+
for col in &index.columns {
91+
push_str(&mut out, col)?;
92+
}
93+
}
8294
Ok(out)
8395
}
8496

8597
/// Decode a stored table definition.
8698
pub(crate) fn decode_table(bytes: &[u8]) -> Result<TableDef> {
8799
let mut r = Reader { rest: bytes };
88100
let version = r.byte()?;
89-
if version != VERSION {
101+
if version != VERSION && version != VERSION_NO_INDEXES {
90102
return Err(corrupt(CatalogCorruption::BadVersion { version }));
91103
}
92104
let name = r.string()?;
@@ -129,6 +141,29 @@ pub(crate) fn decode_table(bytes: &[u8]) -> Result<TableDef> {
129141
for _ in 0..check_count {
130142
checks.push(r.check(0)?);
131143
}
144+
let mut indexes = Vec::new();
145+
if version >= VERSION {
146+
let index_count = r.count()?;
147+
indexes.reserve(index_count.min(64));
148+
for _ in 0..index_count {
149+
let name = r.string()?;
150+
let unique = match r.byte()? {
151+
0 => false,
152+
1 => true,
153+
tag => return Err(corrupt(CatalogCorruption::BadTag { tag })),
154+
};
155+
let col_count = r.count()?;
156+
let mut cols = Vec::with_capacity(col_count.min(16));
157+
for _ in 0..col_count {
158+
cols.push(r.string()?);
159+
}
160+
indexes.push(IndexDef {
161+
name,
162+
columns: cols,
163+
unique,
164+
});
165+
}
166+
}
132167
if !r.rest.is_empty() {
133168
return Err(corrupt(CatalogCorruption::TrailingBytes));
134169
}
@@ -137,6 +172,7 @@ pub(crate) fn decode_table(bytes: &[u8]) -> Result<TableDef> {
137172
columns,
138173
pk,
139174
checks,
175+
indexes,
140176
})
141177
}
142178

crates/catalog/src/db.rs

Lines changed: 77 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ use types::{UuidV7Gen, Value};
99

1010
use crate::codec::decode_table;
1111
use crate::job::{decode_padded, CatalogJob, Env, JobOp, JobOut};
12-
use crate::schema::{ColumnDef, TableDef};
12+
use crate::schema::{ColumnDef, IndexDef, TableDef};
1313
use crate::store;
1414
use crate::{CatalogCorruption, CatalogError, Result};
1515

@@ -86,6 +86,26 @@ impl<B: IoBackend + 'static> Catalog<B> {
8686
.map(|_| ())
8787
}
8888

89+
/// `create index` — validate, **backfill from existing rows** (unique
90+
/// violations reject the DDL), and persist.
91+
pub fn create_index(&self, table: &str, index: IndexDef) -> Result<()> {
92+
self.submit(JobOp::CreateIndex {
93+
table: table.to_string(),
94+
index,
95+
})
96+
.map(|_| ())
97+
}
98+
99+
/// `drop index` — remove the definition and free the index's pages. The
100+
/// implicit index backing a `unique` column cannot be dropped.
101+
pub fn drop_index(&self, table: &str, index: &str) -> Result<()> {
102+
self.submit(JobOp::DropIndex {
103+
table: table.to_string(),
104+
index: index.to_string(),
105+
})
106+
.map(|_| ())
107+
}
108+
89109
/// Insert one row (named columns; omitted columns take defaults/generated
90110
/// values). Returns the full materialized row in schema column order.
91111
pub fn insert(&self, table: &str, row: Vec<(String, Value)>) -> Result<Vec<Value>> {
@@ -213,13 +233,69 @@ impl<B: IoBackend> CatSnapshot<B> {
213233
Ok(rows)
214234
}
215235

236+
/// Cross-check the whole database: every tree is structurally valid and
237+
/// **every index matches its base table entry-for-entry** (the Phase 7
238+
/// exit criterion). Read-only, over this snapshot's pinned version.
239+
pub fn validate(&self) -> Result<()> {
240+
if let Some(cat_root) = self.snap.root() {
241+
self.snap.validate_tree(cat_root)?;
242+
}
243+
for table in self.tables()? {
244+
let def = self.table(&table)?;
245+
let data_root = self.data_root(&table)?;
246+
self.snap.validate_tree(data_root)?;
247+
let base = self.snap.scan_in(data_root)?;
248+
249+
for idx in &def.indexes {
250+
let iroot = self.index_root(&table, &idx.name)?;
251+
self.snap.validate_tree(iroot)?;
252+
253+
// Brute-force expectation from the base rows.
254+
let mut expected: Vec<(Vec<u8>, Vec<u8>)> = Vec::new();
255+
for (pk_key, bytes) in &base {
256+
let row = decode_padded(&def, bytes)?;
257+
let cols: Vec<types::Value> = idx
258+
.columns
259+
.iter()
260+
.map(|name| {
261+
def.col_index(name)
262+
.and_then(|ci| row.get(ci).cloned())
263+
.unwrap_or(types::Value::Null)
264+
})
265+
.collect();
266+
if let Some(e) = index::entry(&cols, pk_key, idx.unique)? {
267+
expected.push((e.key, e.value));
268+
}
269+
}
270+
expected.sort();
271+
272+
let actual = self.snap.scan_in(iroot)?;
273+
if expected != actual {
274+
return Err(CatalogError::IndexOutOfSync {
275+
table: table.clone(),
276+
index: idx.name.clone(),
277+
});
278+
}
279+
}
280+
}
281+
Ok(())
282+
}
283+
216284
fn data_root(&self, table: &str) -> Result<pager::PageId> {
217285
let key = store::root_key(table)?;
218286
match self.snap.get(&key)? {
219287
Some(bytes) => store::decode_root(&bytes),
220288
None => Err(CatalogError::Corrupt(CatalogCorruption::MissingEntry)),
221289
}
222290
}
291+
292+
fn index_root(&self, table: &str, index: &str) -> Result<pager::PageId> {
293+
let key = store::iroot_key(table, index)?;
294+
match self.snap.get(&key)? {
295+
Some(bytes) => store::decode_root(&bytes),
296+
None => Err(CatalogError::Corrupt(CatalogCorruption::MissingEntry)),
297+
}
298+
}
223299
}
224300

225301
/// Map a transaction-layer error back to the catalog's typed error: a

0 commit comments

Comments
 (0)