Skip to content

Commit 19a428a

Browse files
Phase 3: copy-on-write B+tree — lookup, insert/delete, range, snapshots
The ordered map over the pager. A mutation descends to the affected leaf and rebuilds the touched path into freshly allocated pages, returning the new root and the superseded page ids (Edit::freed) for the txn layer to reclaim; the tree never frees a page, so any earlier root stays a valid, immutable snapshot. Variable-length slotted leaf/internal nodes (one per Data page) with a bounds-checked, never-panicking decoder; byte-based fill with split / merge / rotate and single-entry-too-large rejected as a typed error. Range scans use a lazy forward/backward cursor over a root-to-leaf stack (no leaf sibling pointers, so edits stay O(log n) and old roots stay readable). validate() proves balanced depth, ordering consistent with separators, and non-empty non-root nodes. Tests (PLAN §3 exit): seeded model-based property test vs BTreeMap across insert/delete/lookup/range with validate after every step and commit+reopen; a snapshot-isolation test (an old root sees no later writes); node-decoder robustness/fuzz. DECISIONS D5 (node layout/keys), D6 (cursor, no leaf links), D7 (free-reporting contract); CHANGELOG updated.
1 parent abd8b73 commit 19a428a

10 files changed

Lines changed: 1653 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,28 @@ under a category (`Added` / `Changed` / `Fixed` / `Removed` / `Security`).
88

99
## [Unreleased]
1010

11+
### Phase 3 — Copy-on-write B+tree
12+
13+
#### Added
14+
- `btree`: the copy-on-write ordered map over the pager — point lookup,
15+
insert/delete with node split/merge, and forward/backward range scans. A
16+
mutation copies the touched path to a **new root** and returns the superseded
17+
pages (`Edit::freed`) for the `txn` layer to reclaim; the tree never frees a
18+
page, so an earlier root stays a valid snapshot (`DECISIONS.md` D7).
19+
- Variable-length slotted leaf/internal nodes, one per `Data` page, with a
20+
bounds-checked decoder that rejects hostile bytes as typed `Corruption` errors
21+
and never panics; byte-based fill with split / merge / rotate, single entries
22+
capped at half a page (`EntryTooLarge` otherwise) (`DECISIONS.md` D5).
23+
- `Cursor`: a lazy, bounded, forward/backward range iterator that walks a
24+
root-to-leaf stack — no leaf sibling pointers, so edits stay O(log n) and old
25+
roots stay readable (`DECISIONS.md` D6).
26+
- `BTree::validate` proving balanced depth, ordering consistent with separators,
27+
and non-empty non-root nodes.
28+
- Exit-criteria tests: a seeded model-based property test against
29+
`std::collections::BTreeMap` (insert/delete/lookup/range, `validate` after
30+
every step, commit + reopen), a snapshot-isolation test (an old root sees no
31+
later writes), and node-decoder robustness/fuzz tests.
32+
1133
### Phase 2 — Pager (paged, checksummed, atomically-committing storage)
1234

1335
#### Added

DECISIONS.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,59 @@ Per `PLAN.md` §1 rule 6, every resolution of an ambiguity or deviation from
55

66
---
77

8+
## D7 — B+tree mutations report superseded pages; the tree never frees
9+
10+
**Phase:** 3 · **Status:** accepted
11+
12+
`ARCHITECTURE.md` §3.2 says a modification "copies the touched path to a new
13+
root" and "the `txn` layer installs it on commit", and §3.3 ties page
14+
reclamation to live snapshots. That leaves open *who frees the old path*.
15+
16+
**Decision:** the B+tree is a pure transformation over pager pages and **never
17+
frees a page itself**. `insert`/`delete` return an `Edit { new_root, freed }`,
18+
where `freed` lists the old copied-path (and merged-sibling) pages. The
19+
caller decides when to reclaim them — in Phase 4 the `txn` layer frees a page
20+
only once no live snapshot needs it; that is exactly what keeps an earlier root a
21+
valid, immutable snapshot. Phase-3 tests free eagerly when no snapshot is pinned
22+
(to bound file growth) and skip freeing for the snapshot-isolation test.
23+
24+
## D6 — No leaf sibling pointers; range scans use a root-to-leaf cursor stack
25+
26+
**Phase:** 3 · **Status:** accepted
27+
28+
A classic B+tree links leaves for fast range scans. Under copy-on-write that is
29+
costly: editing a leaf would force copying its linked neighbours (to update their
30+
pointers), turning an O(log n) path copy into O(n) fan-out.
31+
32+
**Decision:** store **no sibling pointers**. A `Cursor` holds the descent path
33+
(a stack of node + index) and advances by walking the stack — O(log
34+
n) to cross a leaf boundary, both forward and backward. The cursor reads a fixed
35+
root, so it is a stable snapshot for its whole life. This also makes nodes purely
36+
parent-referenced, which is what lets an old root stay valid (see D7).
37+
38+
## D5 — Variable-length slotted nodes, byte-fill split/merge, provisional raw keys
39+
40+
**Phase:** 3 · **Status:** accepted
41+
42+
`ARCHITECTURE.md` §3.2 mandates an order-preserving key encoding (delivered by
43+
`types` in Phase 5) and node split/merge, but not a concrete node layout.
44+
45+
**Decision:**
46+
- **Node layout:** one node per `Data` page; a kind byte distinguishes leaf vs
47+
internal in the payload. Keys/values are variable length, so fill is measured
48+
in **bytes**: a node splits when an entry won't fit and is rebalanced (merge,
49+
or merge-then-split) when it drops below ¼-page. Each cell is capped at half a
50+
page (`MAX_CELL`) so any two cells share a page — guaranteeing a split always
51+
yields two non-empty halves and an internal node always holds ≥2 children. A
52+
single entry over the cap is a typed `EntryTooLarge` error; v1 has no overflow
53+
pages (deferred).
54+
- **Decode-whole/encode-whole:** because CoW rewrites a whole node on every edit,
55+
nodes are decoded to an in-memory form and re-encoded rather than edited
56+
in-place — simpler and the cost is dwarfed by the page write.
57+
- **Provisional keys:** keys are compared **bytewise** (raw `&[u8]`). PLAN §3
58+
calls for this provisional scheme; the Phase-5 order-preserving encoding will
59+
produce byte strings that compare identically, so the tree is unaffected.
60+
861
## D4 — Seeded, model-based property tests instead of `proptest`
962

1063
**Phase:** 2 · **Status:** accepted

crates/btree/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,9 @@ common.workspace = true
1414
pager.workspace = true
1515
thiserror.workspace = true
1616

17+
[dev-dependencies]
18+
common.workspace = true
19+
pager.workspace = true
20+
1721
[lints]
1822
workspace = true

crates/btree/src/cursor.rs

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
//! A lazy range cursor over a B+tree snapshot.
2+
//!
3+
//! CoW trees keep no leaf sibling pointers (they would have to be copied on
4+
//! every edit), so iteration walks a root-to-leaf path held on a stack. The
5+
//! cursor reads the tree at a fixed `root`, giving a stable view for its whole
6+
//! life regardless of concurrent writers installing new roots.
7+
8+
use common::IoBackend;
9+
use pager::{PageId, Pager, HEADER_SIZE};
10+
11+
use crate::node::Node;
12+
use crate::Result;
13+
14+
/// Iteration direction.
15+
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
16+
pub enum Direction {
17+
/// Ascending key order.
18+
Forward,
19+
/// Descending key order.
20+
Backward,
21+
}
22+
23+
/// A key/value pair yielded by a [`Cursor`].
24+
pub type Item = (Vec<u8>, Vec<u8>);
25+
26+
/// One level of the descent path.
27+
struct Frame {
28+
node: Node,
29+
/// Internal: the child index this frame descends through. Leaf: the next
30+
/// entry index a forward step would yield (a backward step yields `idx-1`).
31+
idx: usize,
32+
}
33+
34+
/// A lazy, bounded iterator over `[lo, hi)` of a tree.
35+
///
36+
/// Advance it with [`next_entry`](Cursor::next_entry), which returns `Ok(None)`
37+
/// at the end. Both bounds are optional; `lo` is inclusive and `hi` is exclusive.
38+
pub struct Cursor<'p, B: IoBackend> {
39+
pager: &'p Pager<B>,
40+
dir: Direction,
41+
lo: Option<Vec<u8>>,
42+
hi: Option<Vec<u8>>,
43+
path: Vec<Frame>,
44+
}
45+
46+
impl<'p, B: IoBackend> Cursor<'p, B> {
47+
pub(crate) fn open(
48+
pager: &'p Pager<B>,
49+
root: PageId,
50+
dir: Direction,
51+
lo: Option<&[u8]>,
52+
hi: Option<&[u8]>,
53+
) -> Result<Cursor<'p, B>> {
54+
let mut cursor = Cursor {
55+
pager,
56+
dir,
57+
lo: lo.map(<[u8]>::to_vec),
58+
hi: hi.map(<[u8]>::to_vec),
59+
path: Vec::new(),
60+
};
61+
// Both directions seek to "the first entry ≥ the start bound": for
62+
// forward that bound is `lo` (and we step forward from it); for backward
63+
// it is `hi` (and we step backward into the range from it). An absent
64+
// bound is −∞ for forward (the start) but +∞ for backward (the end).
65+
let bound = match dir {
66+
Direction::Forward => cursor.lo.clone(),
67+
Direction::Backward => cursor.hi.clone(),
68+
};
69+
match (dir, bound) {
70+
(Direction::Backward, None) => cursor.descend_rightmost(root)?,
71+
(_, b) => cursor.descend_lower_bound(root, b.as_deref())?,
72+
}
73+
Ok(cursor)
74+
}
75+
76+
/// Yield the next pair, or `Ok(None)` when the range is exhausted. Fallible
77+
/// (a page read may fail), so this is not an [`Iterator`].
78+
pub fn next_entry(&mut self) -> Result<Option<Item>> {
79+
match self.dir {
80+
Direction::Forward => self.next_forward(),
81+
Direction::Backward => self.next_backward(),
82+
}
83+
}
84+
85+
/// Collect the remaining pairs into a vector (convenience for callers/tests).
86+
pub fn collect_all(mut self) -> Result<Vec<Item>> {
87+
let mut out = Vec::new();
88+
while let Some(item) = self.next_entry()? {
89+
out.push(item);
90+
}
91+
Ok(out)
92+
}
93+
94+
fn read(&self, id: PageId) -> Result<Node> {
95+
let frame = self.pager.read_page(id)?;
96+
Node::decode(&frame[HEADER_SIZE..], id)
97+
}
98+
99+
/// Build a path positioned at the first entry whose key is ≥ `bound`
100+
/// (or the leaf's end if none here — a caller step then crosses leaves).
101+
fn descend_lower_bound(&mut self, root: PageId, bound: Option<&[u8]>) -> Result<()> {
102+
let mut id = root;
103+
loop {
104+
let node = self.read(id)?;
105+
match &node {
106+
Node::Internal { keys, children } => {
107+
let ci = bound.map_or(0, |b| keys.partition_point(|k| k.as_slice() <= b));
108+
let next = children[ci];
109+
self.path.push(Frame { node, idx: ci });
110+
id = next;
111+
}
112+
Node::Leaf { keys, .. } => {
113+
let start = bound.map_or(0, |b| keys.partition_point(|k| k.as_slice() < b));
114+
self.path.push(Frame { node, idx: start });
115+
return Ok(());
116+
}
117+
}
118+
}
119+
}
120+
121+
/// Descend to the leftmost leaf of `id`, positioned at its first entry.
122+
fn descend_leftmost(&mut self, id: PageId) -> Result<()> {
123+
let mut id = id;
124+
loop {
125+
let node = self.read(id)?;
126+
match &node {
127+
Node::Internal { children, .. } => {
128+
let next = children[0];
129+
self.path.push(Frame { node, idx: 0 });
130+
id = next;
131+
}
132+
Node::Leaf { .. } => {
133+
self.path.push(Frame { node, idx: 0 });
134+
return Ok(());
135+
}
136+
}
137+
}
138+
}
139+
140+
/// Descend to the rightmost leaf of `id`, positioned just past its last
141+
/// entry (so a backward step yields that last entry).
142+
fn descend_rightmost(&mut self, id: PageId) -> Result<()> {
143+
let mut id = id;
144+
loop {
145+
let node = self.read(id)?;
146+
match &node {
147+
Node::Internal { children, .. } => {
148+
let last = children.len() - 1;
149+
let next = children[last];
150+
self.path.push(Frame { node, idx: last });
151+
id = next;
152+
}
153+
Node::Leaf { keys, .. } => {
154+
let end = keys.len();
155+
self.path.push(Frame { node, idx: end });
156+
return Ok(());
157+
}
158+
}
159+
}
160+
}
161+
162+
fn next_forward(&mut self) -> Result<Option<Item>> {
163+
loop {
164+
let Some(frame) = self.path.last() else {
165+
return Ok(None);
166+
};
167+
if let Node::Leaf { keys, vals } = &frame.node {
168+
let idx = frame.idx;
169+
if idx < keys.len() {
170+
if self
171+
.hi
172+
.as_deref()
173+
.is_some_and(|hi| keys[idx].as_slice() >= hi)
174+
{
175+
self.path.clear();
176+
return Ok(None);
177+
}
178+
let item = (keys[idx].clone(), vals[idx].clone());
179+
if let Some(top) = self.path.last_mut() {
180+
top.idx += 1;
181+
}
182+
return Ok(Some(item));
183+
}
184+
}
185+
// Leaf exhausted: drop it and advance the parent to the next child.
186+
self.path.pop();
187+
if !self.advance_to_next_subtree()? {
188+
return Ok(None);
189+
}
190+
}
191+
}
192+
193+
fn next_backward(&mut self) -> Result<Option<Item>> {
194+
loop {
195+
let Some(frame) = self.path.last() else {
196+
return Ok(None);
197+
};
198+
if let Node::Leaf { keys, vals } = &frame.node {
199+
if frame.idx > 0 {
200+
let idx = frame.idx - 1;
201+
if self
202+
.lo
203+
.as_deref()
204+
.is_some_and(|lo| keys[idx].as_slice() < lo)
205+
{
206+
self.path.clear();
207+
return Ok(None);
208+
}
209+
let item = (keys[idx].clone(), vals[idx].clone());
210+
if let Some(top) = self.path.last_mut() {
211+
top.idx -= 1;
212+
}
213+
return Ok(Some(item));
214+
}
215+
}
216+
// Start of this leaf: drop it and retreat to the previous child.
217+
self.path.pop();
218+
if !self.retreat_to_prev_subtree()? {
219+
return Ok(None);
220+
}
221+
}
222+
}
223+
224+
/// After popping an exhausted leaf, move the path to the leftmost leaf of the
225+
/// next sibling subtree. Returns `false` if there is none.
226+
fn advance_to_next_subtree(&mut self) -> Result<bool> {
227+
while let Some(frame) = self.path.last_mut() {
228+
if let Node::Internal { children, .. } = &frame.node {
229+
frame.idx += 1;
230+
if frame.idx < children.len() {
231+
let child = children[frame.idx];
232+
self.descend_leftmost(child)?;
233+
return Ok(true);
234+
}
235+
}
236+
self.path.pop();
237+
}
238+
Ok(false)
239+
}
240+
241+
/// After popping a leaf at its start, move the path to the rightmost leaf of
242+
/// the previous sibling subtree. Returns `false` if there is none.
243+
fn retreat_to_prev_subtree(&mut self) -> Result<bool> {
244+
while let Some(frame) = self.path.last_mut() {
245+
if let Node::Internal { .. } = &frame.node {
246+
if frame.idx > 0 {
247+
frame.idx -= 1;
248+
let child = match &frame.node {
249+
Node::Internal { children, .. } => children[frame.idx],
250+
Node::Leaf { .. } => return Ok(false),
251+
};
252+
self.descend_rightmost(child)?;
253+
return Ok(true);
254+
}
255+
}
256+
self.path.pop();
257+
}
258+
Ok(false)
259+
}
260+
}

0 commit comments

Comments
 (0)