You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/APPEND_ONLY_RAFT_DOVETAIL.md
+94-25Lines changed: 94 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,24 @@ or any other model where storage mutates state in place.
12
12
This doc names the property explicitly + lists the operational
13
13
consequences so adopters can choose the right deployment shape.
14
14
15
+
> ### Scope: external architecture pattern, not a built-in lance-graph feature
16
+
>
17
+
> Post-merge correction (paralleling peer review on the companion doc
18
+
> PR #453): the deployment shape described below ("peer-Raft +
19
+
> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters
20
+
> can build on top of lance-graph, NOT a built-in lance-graph
21
+
> capability. Lance-graph provides the append-only columnar storage +
22
+
> the DataFusion query path + the encoding crates; the Raft layer +
23
+
> the substrate binary + the consensus-replication path are
24
+
> downstream consumer code (e.g. `openraft` or `surreal-cluster`).
25
+
> Adopters who only consume lance-graph's columnar + DataFusion path
26
+
> should NOT assume their data is automatically replicated.
27
+
>
28
+
> The doc documents WHY this pattern works well WHEN built on
29
+
> lance-graph — the storage-append/consensus-append dovetail property
30
+
> — not a feature lance-graph itself ships.
31
+
32
+
15
33
## The two write shapes that have to align
16
34
17
35
A distributed Lance deployment has two write paths:
@@ -35,15 +53,16 @@ Compare this with the conventional alternatives:
35
53
|---|---|---|
36
54
|**LSM-tree (Cassandra)**| Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
37
55
|**B-tree (PostgreSQL)**| 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
38
-
|**Append-only Lance**| Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems.|
56
+
|**Append-only Lance**| Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.)|
39
57
40
58
## Operational consequences
41
59
42
-
### 1. No compaction storms
60
+
### 1. Compaction is qualitatively different, not absent
43
61
44
-
Cassandra clusters periodically run compaction (rewrite SSTables to
45
-
reclaim space + maintain read performance). Each node compacts on
46
-
its own schedule. During compaction:
62
+
Cassandra clusters periodically run LSM compaction (rewrite SSTables
63
+
to reclaim tombstones + merge sorted runs to maintain read
64
+
performance). Each node compacts on its own schedule. During
65
+
compaction:
47
66
48
67
- The compacting node's CPU spikes
49
68
- Its disk write bandwidth spikes
@@ -56,15 +75,54 @@ The cluster operator's job is partly to schedule compactions across
56
75
nodes so that not too many compact simultaneously. This is a
57
76
significant operational burden.
58
77
59
-
Lance has no compaction. The version log IS the truth; old fragments
60
-
can be reclaimed by version-based GC (a much simpler operation than
61
-
SSTable compaction) but the GC is local-only, doesn't interact with
62
-
replication, and doesn't reorder anything.
78
+
Lance has compaction TOO, but of a qualitatively different shape:
79
+
`DatasetOptimizer.compact_files` merges small fragments into larger
80
+
ones to optimize query layout (many small appends produce many small
81
+
fragments which slow scans). For datasets that use deletes, updates,
82
+
or dropped columns, the SAME compaction also performs reclamation —
83
+
deletion vectors get materialized away (removing rows logically
84
+
marked for delete) and dropped columns are physically removed by
85
+
default. So there IS a reclamation role on those datasets; it is
86
+
just NOT the LSM tombstone-reclaim mechanism (Lance has no
87
+
tombstones at the version level — deletions are tracked by
88
+
deletion-vectors against append-only fragments). For append-only
89
+
write workloads with no deletes/updates/drops, the layout-only
90
+
framing applies; for mixed-write workloads, the reclamation
91
+
component is also present. Either way the operation produces new
92
+
append-only fragments at a better layout.
93
+
94
+
Operationally:
95
+
96
+
- The compaction OPERATION runs locally on whichever node has the
97
+
current leader role (or on a node permitted to run a maintenance
98
+
task in the chosen deployment shape); it does not block normal
99
+
write flow at the application layer
100
+
- The OUTPUT of compaction — the new manifest version + the new
101
+
set of fragments — flows through the Raft log like any other
102
+
Lance commit. Peers see the new manifest version after consensus
103
+
commits, and anti-entropy converges replicas to the post-compaction
104
+
state. So the result REPLICATES; the work that produced the result
105
+
is what runs in one place
106
+
- Per-node SCHEDULING choices do not stack into coordination
107
+
headaches the way Cassandra LSM scheduling does, because each
108
+
compaction's product is a single committed version (not a
109
+
per-replica concurrent rewrite that has to be reconciled). At most
110
+
one node should run a given compaction at a time to avoid wasted
111
+
work; this is a coordination choice (lock or leader-only), not a
112
+
coordination headache
113
+
- The failure modes are smaller: a partial compaction is recoverable
114
+
via Raft's standard log replay; no in-flight LSM tombstones to
115
+
lose; correctness is unaffected
63
116
64
117
A peer-Raft + Lance deployment therefore has uniform per-node
65
-
behavior. Each node is doing the same work at the same time, with
66
-
the same shape. The operations runbook is simpler because the
67
-
failure modes are simpler.
118
+
behavior under consensus. Compaction is a maintenance operation
119
+
that produces a normal commit; operators plan for it (it consumes
120
+
CPU + IO when it runs) but the cluster-wide coordination model is
121
+
simpler than Cassandra's per-node-independent LSM compaction
122
+
scheduling. (Per post-merge correction on PR #452; sharpened per
123
+
codex P2 review on PR #454 — the prior framing said 'independent
124
+
of consensus' which overclaimed; the operation is local but the
125
+
output replicates.)
68
126
69
127
### 2. Anti-entropy is a hash compare, not a Merkle-tree walk
70
128
@@ -109,19 +167,30 @@ footprint depends on which columns mutated, whether the row was new
109
167
or updated, whether the column had a previous value. Cross-DC
110
168
replication budget is harder to plan.
111
169
112
-
### 5. The consensus tax lands once, not twice
113
-
114
-
This is the unifying point: with non-append-only storage, an
115
-
application that wants linearizable writes pays the consensus tax
116
-
TWICE. Once for the consensus protocol shipping operations to
117
-
replicas. Once for the storage layer doing per-node compaction +
118
-
mutation bookkeeping. The two taxes interact (a compaction storm
119
-
delays consensus catch-up; a Raft snapshot has to materialize the
120
-
LSM-tree state).
121
-
122
-
With Lance + Raft, the consensus tax and the storage tax are the
123
-
SAME tax. The append IS both the consensus log entry and the storage
124
-
commit. You pay it once.
170
+
### 5. The consensus tax and the storage-COMMIT tax are the same tax
171
+
172
+
This is the unifying point: with LSM-tree storage, an application
173
+
that wants linearizable writes pays the consensus tax TWICE. Once
174
+
for the consensus protocol shipping operations to replicas. Once for
175
+
the storage layer doing per-node tombstone-reclaim + run-merge
176
+
compaction. The two taxes interact (a compaction storm delays
177
+
consensus catch-up; a Raft snapshot has to materialize the LSM-tree
178
+
state).
179
+
180
+
With Lance + Raft, the consensus tax and the storage-COMMIT tax are
181
+
the SAME tax — the append IS both the consensus log entry and the
182
+
storage commit; you pay it once. Lance does have its own file-
183
+
compaction cycle (`DatasetOptimizer.compact_files`), which produces
184
+
a NEW manifest version — and that new version flows through the
185
+
same Raft log as any other write, replicating to peers via the
186
+
normal consensus + anti-entropy path. So compaction's OUTPUT
187
+
counts as a consensus event (one more append). What it does NOT
188
+
add is a SECOND tax of the LSM-tree kind (per-node tombstone-
189
+
reclaim + run-merge bookkeeping that runs on every replica
190
+
independently and creates coordination headaches with replication).
191
+
The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus
192
+
tax as a regular write (one commit), and does NOT layer a separate
0 commit comments