Skip to content

Commit 4bd38f4

Browse files
committed
fix(codex): compaction-not-local-only + compaction-DOES-reclaim + versions-is-not-a-change-index
Three P2 findings on PR #454 walked back, all real overclaims: P2 1 (APPEND_ONLY_RAFT_DOVETAIL.md): Walked back 'compaction runs INDEPENDENTLY of consensus replication'. Under peer-Raft, compaction COMMITS a new manifest version which IS part of the consensus log; the compaction OPERATION runs locally but its OUTPUT (new fragments + new manifest delta) flows through Raft like any other write. So peers see it. The independence claim should have been narrower: the SCHEDULING of the operation is local; the OUTCOME replicates. P2 2 (APPEND_ONLY_RAFT_DOVETAIL.md): Walked back 'no tombstones to reclaim in the LSM sense; layout optimization only'. Lance compact_files DOES reclaim deleted rows and dropped columns by default. Different mechanism from LSM tombstones (Lance has no tombstones at the version level; rows are deleted via deletion vectors which compaction materializes away) but the functional role of reclamation IS present for datasets that use deletes, updates, or dropped columns. The doc now distinguishes 'no LSM-style tombstone reclaim' from 'has deletion-vector reclaim and dropped-column reclaim'. P2 3 (CLUSTER_ASYMMETRY.md): Walked back the claim that Lance versions() is 'the time-axis index of which identity positions changed when'. versions() returns Vec<lance::dataset::Version> metadata: snapshot tags + timestamps, not a change-set index. To find which identities changed between versions, adopters compare snapshots OR maintain a separate index. The corrected bullet describes versions() as the version-snapshot log (which it is) and notes that the change-set derivation is consumer code. Provenance: Codex P2 review on PR #454 commit 16f879b.
1 parent d854974 commit 4bd38f4

2 files changed

Lines changed: 67 additions & 31 deletions

File tree

docs/APPEND_ONLY_RAFT_DOVETAIL.md

Lines changed: 52 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Compare this with the conventional alternatives:
5353
|---|---|---|
5454
| **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
5555
| **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
56-
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. (Lance does have its own file-layout compaction via `DatasetOptimizer.compact_files`, but it runs INDEPENDENTLY of consensus — see Operational consequence #1.) |
56+
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) |
5757

5858
## Operational consequences
5959

@@ -78,28 +78,51 @@ significant operational burden.
7878
Lance has compaction TOO, but of a qualitatively different shape:
7979
`DatasetOptimizer.compact_files` merges small fragments into larger
8080
ones to optimize query layout (many small appends produce many small
81-
fragments which slow scans). It is NOT a tombstone-reclaim cycle —
82-
Lance is append-only at the version level, so there are no tombstones
83-
to reclaim in the LSM sense. File compaction takes existing
84-
append-only fragments and produces new append-only fragments at a
85-
better layout.
81+
fragments which slow scans). For datasets that use deletes, updates,
82+
or dropped columns, the SAME compaction also performs reclamation —
83+
deletion vectors get materialized away (removing rows logically
84+
marked for delete) and dropped columns are physically removed by
85+
default. So there IS a reclamation role on those datasets; it is
86+
just NOT the LSM tombstone-reclaim mechanism (Lance has no
87+
tombstones at the version level — deletions are tracked by
88+
deletion-vectors against append-only fragments). For append-only
89+
write workloads with no deletes/updates/drops, the layout-only
90+
framing applies; for mixed-write workloads, the reclamation
91+
component is also present. Either way the operation produces new
92+
append-only fragments at a better layout.
8693

8794
Operationally:
8895

89-
- Compaction runs INDEPENDENTLY of consensus replication — file
90-
compaction does not block writes, does not affect Raft log shipping,
91-
does not coordinate across nodes
92-
- Per-node compaction is local-only; each node compacts its own
93-
Lance dataset on its own schedule without affecting peers
94-
- The failure modes are smaller (a partial compaction is recoverable;
95-
no in-flight tombstones to lose; correctness is unaffected)
96+
- The compaction OPERATION runs locally on whichever node has the
97+
current leader role (or on a node permitted to run a maintenance
98+
task in the chosen deployment shape); it does not block normal
99+
write flow at the application layer
100+
- The OUTPUT of compaction — the new manifest version + the new
101+
set of fragments — flows through the Raft log like any other
102+
Lance commit. Peers see the new manifest version after consensus
103+
commits, and anti-entropy converges replicas to the post-compaction
104+
state. So the result REPLICATES; the work that produced the result
105+
is what runs in one place
106+
- Per-node SCHEDULING choices do not stack into coordination
107+
headaches the way Cassandra LSM scheduling does, because each
108+
compaction's product is a single committed version (not a
109+
per-replica concurrent rewrite that has to be reconciled). At most
110+
one node should run a given compaction at a time to avoid wasted
111+
work; this is a coordination choice (lock or leader-only), not a
112+
coordination headache
113+
- The failure modes are smaller: a partial compaction is recoverable
114+
via Raft's standard log replay; no in-flight LSM tombstones to
115+
lose; correctness is unaffected
96116

97117
A peer-Raft + Lance deployment therefore has uniform per-node
98-
behavior under both consensus and compaction. Operators still plan
99-
for file compaction (it consumes CPU + IO when it runs), but the
100-
cluster-wide coordination burden is significantly lower than
101-
Cassandra's LSM compaction scheduling. (Per post-merge correction
102-
on PR #452.)
118+
behavior under consensus. Compaction is a maintenance operation
119+
that produces a normal commit; operators plan for it (it consumes
120+
CPU + IO when it runs) but the cluster-wide coordination model is
121+
simpler than Cassandra's per-node-independent LSM compaction
122+
scheduling. (Per post-merge correction on PR #452; sharpened per
123+
codex P2 review on PR #454 — the prior framing said 'independent
124+
of consensus' which overclaimed; the operation is local but the
125+
output replicates.)
103126

104127
### 2. Anti-entropy is a hash compare, not a Merkle-tree walk
105128

@@ -157,12 +180,17 @@ state).
157180
With Lance + Raft, the consensus tax and the storage-COMMIT tax are
158181
the SAME tax — the append IS both the consensus log entry and the
159182
storage commit; you pay it once. Lance does have its own file-
160-
compaction cycle (`DatasetOptimizer.compact_files`), but it runs
161-
INDEPENDENTLY of consensus — file compaction takes existing
162-
append-only fragments and rewrites them into bigger append-only
163-
fragments without tombstones to reclaim and without coordinating
164-
with replication. So the LAYOUT-OPTIMIZATION cycle exists but does
165-
not interact with the consensus tax.
183+
compaction cycle (`DatasetOptimizer.compact_files`), which produces
184+
a NEW manifest version — and that new version flows through the
185+
same Raft log as any other write, replicating to peers via the
186+
normal consensus + anti-entropy path. So compaction's OUTPUT
187+
counts as a consensus event (one more append). What it does NOT
188+
add is a SECOND tax of the LSM-tree kind (per-node tombstone-
189+
reclaim + run-merge bookkeeping that runs on every replica
190+
independently and creates coordination headaches with replication).
191+
The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus
192+
tax as a regular write (one commit), and does NOT layer a separate
193+
per-replica storage tax on top.
166194

167195
## What this implies for deployment shape
168196

docs/CLUSTER_ASYMMETRY.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -103,13 +103,21 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations:
103103
nibbles common across many entities can be stored once per path
104104
segment via consumer-side structures (O(k) lookup, prefix-sharing).
105105
**The dedup-by-prefix data structure itself is consumer code, not
106-
a built-in lance-graph crate.** Lance's own `versions()` log is the
107-
time-axis (cross-session index of which identity positions changed
108-
when). An earlier version of this doc cited `vort/vart` as if it
109-
were a shipped crate; corrected per peer review — the radix-shaped
110-
trie at the cognitive layer is a proposed pattern (no shipped crate
111-
name) and the identity primitive + the time-axis are the two
112-
shipped surfaces this bullet should have cited from the start.
106+
a built-in lance-graph crate.** Lance's `versions()` returns
107+
`Vec<lance::dataset::Version>` — the time-axis is the
108+
version-snapshot LOG (each version is a tagged snapshot of the
109+
dataset; the log is append-only, ordered, and queryable). It does
110+
NOT itself identify which identities changed in each snapshot;
111+
adopters who need a change-set derive it by comparing snapshots
112+
(or by maintaining a separate change-index per their workload's
113+
needs). Codex P2 review on PR #454 caught the prior overclaim
114+
that `versions()` was a 'changed-position index'; corrected:
115+
it's the snapshot LOG, and the change-set derivation is consumer
116+
code. An earlier version of this doc cited `vort/vart` as if it
117+
were a shipped crate (a separate fix). The two shipped surfaces
118+
for the bullet are `NiblePath` (identity) and `versions()`
119+
(snapshot log); the radix-shaped consumer trie and the
120+
change-set index are both consumer code.
113121

114122
Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
115123
indexed graph form is multi-TB with replication factor 3 → multi-TB

0 commit comments

Comments
 (0)