Skip to content

Commit 2d78b85

Browse files
authored
Merge pull request #454 from AdaWorldAPI/docs/post-merge-corrections-architecture
docs: post-merge corrections on PR #452 + #453 architecture docs
2 parents 16f879b + 4bd38f4 commit 2d78b85

2 files changed

Lines changed: 117 additions & 31 deletions

File tree

docs/APPEND_ONLY_RAFT_DOVETAIL.md

Lines changed: 94 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,24 @@ or any other model where storage mutates state in place.
1212
This doc names the property explicitly + lists the operational
1313
consequences so adopters can choose the right deployment shape.
1414

15+
> ### Scope: external architecture pattern, not a built-in lance-graph feature
16+
>
17+
> Post-merge correction (paralleling peer review on the companion doc
18+
> PR #453): the deployment shape described below ("peer-Raft +
19+
> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters
20+
> can build on top of lance-graph, NOT a built-in lance-graph
21+
> capability. Lance-graph provides the append-only columnar storage +
22+
> the DataFusion query path + the encoding crates; the Raft layer +
23+
> the substrate binary + the consensus-replication path are
24+
> downstream consumer code (e.g. `openraft` or `surreal-cluster`).
25+
> Adopters who only consume lance-graph's columnar + DataFusion path
26+
> should NOT assume their data is automatically replicated.
27+
>
28+
> The doc documents WHY this pattern works well WHEN built on
29+
> lance-graph — the storage-append/consensus-append dovetail property
30+
> — not a feature lance-graph itself ships.
31+
32+
1533
## The two write shapes that have to align
1634

1735
A distributed Lance deployment has two write paths:
@@ -35,15 +53,16 @@ Compare this with the conventional alternatives:
3553
|---|---|---|
3654
| **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
3755
| **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
38-
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. |
56+
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) |
3957

4058
## Operational consequences
4159

42-
### 1. No compaction storms
60+
### 1. Compaction is qualitatively different, not absent
4361

44-
Cassandra clusters periodically run compaction (rewrite SSTables to
45-
reclaim space + maintain read performance). Each node compacts on
46-
its own schedule. During compaction:
62+
Cassandra clusters periodically run LSM compaction (rewrite SSTables
63+
to reclaim tombstones + merge sorted runs to maintain read
64+
performance). Each node compacts on its own schedule. During
65+
compaction:
4766

4867
- The compacting node's CPU spikes
4968
- Its disk write bandwidth spikes
@@ -56,15 +75,54 @@ The cluster operator's job is partly to schedule compactions across
5675
nodes so that not too many compact simultaneously. This is a
5776
significant operational burden.
5877

59-
Lance has no compaction. The version log IS the truth; old fragments
60-
can be reclaimed by version-based GC (a much simpler operation than
61-
SSTable compaction) but the GC is local-only, doesn't interact with
62-
replication, and doesn't reorder anything.
78+
Lance has compaction TOO, but of a qualitatively different shape:
79+
`DatasetOptimizer.compact_files` merges small fragments into larger
80+
ones to optimize query layout (many small appends produce many small
81+
fragments which slow scans). For datasets that use deletes, updates,
82+
or dropped columns, the SAME compaction also performs reclamation —
83+
deletion vectors get materialized away (removing rows logically
84+
marked for delete) and dropped columns are physically removed by
85+
default. So there IS a reclamation role on those datasets; it is
86+
just NOT the LSM tombstone-reclaim mechanism (Lance has no
87+
tombstones at the version level — deletions are tracked by
88+
deletion-vectors against append-only fragments). For append-only
89+
write workloads with no deletes/updates/drops, the layout-only
90+
framing applies; for mixed-write workloads, the reclamation
91+
component is also present. Either way the operation produces new
92+
append-only fragments at a better layout.
93+
94+
Operationally:
95+
96+
- The compaction OPERATION runs locally on whichever node has the
97+
current leader role (or on a node permitted to run a maintenance
98+
task in the chosen deployment shape); it does not block normal
99+
write flow at the application layer
100+
- The OUTPUT of compaction — the new manifest version + the new
101+
set of fragments — flows through the Raft log like any other
102+
Lance commit. Peers see the new manifest version after consensus
103+
commits, and anti-entropy converges replicas to the post-compaction
104+
state. So the result REPLICATES; the work that produced the result
105+
is what runs in one place
106+
- Per-node SCHEDULING choices do not stack into coordination
107+
headaches the way Cassandra LSM scheduling does, because each
108+
compaction's product is a single committed version (not a
109+
per-replica concurrent rewrite that has to be reconciled). At most
110+
one node should run a given compaction at a time to avoid wasted
111+
work; this is a coordination choice (lock or leader-only), not a
112+
coordination headache
113+
- The failure modes are smaller: a partial compaction is recoverable
114+
via Raft's standard log replay; no in-flight LSM tombstones to
115+
lose; correctness is unaffected
63116

64117
A peer-Raft + Lance deployment therefore has uniform per-node
65-
behavior. Each node is doing the same work at the same time, with
66-
the same shape. The operations runbook is simpler because the
67-
failure modes are simpler.
118+
behavior under consensus. Compaction is a maintenance operation
119+
that produces a normal commit; operators plan for it (it consumes
120+
CPU + IO when it runs) but the cluster-wide coordination model is
121+
simpler than Cassandra's per-node-independent LSM compaction
122+
scheduling. (Per post-merge correction on PR #452; sharpened per
123+
codex P2 review on PR #454 — the prior framing said 'independent
124+
of consensus' which overclaimed; the operation is local but the
125+
output replicates.)
68126

69127
### 2. Anti-entropy is a hash compare, not a Merkle-tree walk
70128

@@ -109,19 +167,30 @@ footprint depends on which columns mutated, whether the row was new
109167
or updated, whether the column had a previous value. Cross-DC
110168
replication budget is harder to plan.
111169

112-
### 5. The consensus tax lands once, not twice
113-
114-
This is the unifying point: with non-append-only storage, an
115-
application that wants linearizable writes pays the consensus tax
116-
TWICE. Once for the consensus protocol shipping operations to
117-
replicas. Once for the storage layer doing per-node compaction +
118-
mutation bookkeeping. The two taxes interact (a compaction storm
119-
delays consensus catch-up; a Raft snapshot has to materialize the
120-
LSM-tree state).
121-
122-
With Lance + Raft, the consensus tax and the storage tax are the
123-
SAME tax. The append IS both the consensus log entry and the storage
124-
commit. You pay it once.
170+
### 5. The consensus tax and the storage-COMMIT tax are the same tax
171+
172+
This is the unifying point: with LSM-tree storage, an application
173+
that wants linearizable writes pays the consensus tax TWICE. Once
174+
for the consensus protocol shipping operations to replicas. Once for
175+
the storage layer doing per-node tombstone-reclaim + run-merge
176+
compaction. The two taxes interact (a compaction storm delays
177+
consensus catch-up; a Raft snapshot has to materialize the LSM-tree
178+
state).
179+
180+
With Lance + Raft, the consensus tax and the storage-COMMIT tax are
181+
the SAME tax — the append IS both the consensus log entry and the
182+
storage commit; you pay it once. Lance does have its own file-
183+
compaction cycle (`DatasetOptimizer.compact_files`), which produces
184+
a NEW manifest version — and that new version flows through the
185+
same Raft log as any other write, replicating to peers via the
186+
normal consensus + anti-entropy path. So compaction's OUTPUT
187+
counts as a consensus event (one more append). What it does NOT
188+
add is a SECOND tax of the LSM-tree kind (per-node tombstone-
189+
reclaim + run-merge bookkeeping that runs on every replica
190+
independently and creates coordination headaches with replication).
191+
The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus
192+
tax as a regular write (one commit), and does NOT layer a separate
193+
per-replica storage tax on top.
125194

126195
## What this implies for deployment shape
127196

docs/CLUSTER_ASYMMETRY.md

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -95,12 +95,29 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations:
9595
reduces ANN scan space by ~16× under empirical intra-family
9696
locality (98.6% per the `lance-graph` PR #444 probe).
9797

98-
- **vort/vart adaptive radix trie**: structural deduplication of
99-
shared HHTL prefixes. The heel + hip nibbles common across many
100-
entities are stored ONCE per path segment, not N times. Adaptive
101-
Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
102-
structure also serves as the time-axis index over Lance
103-
`versions()` for cold-path queries.
98+
- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance
99+
`versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed
100+
into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want
101+
to dedupe shared HHTL prefixes in memory typically derive an
102+
adaptive radix-trie index over `NiblePath` addresses — heel + hip
103+
nibbles common across many entities can be stored once per path
104+
segment via consumer-side structures (O(k) lookup, prefix-sharing).
105+
**The dedup-by-prefix data structure itself is consumer code, not
106+
a built-in lance-graph crate.** Lance's `versions()` returns
107+
`Vec<lance::dataset::Version>` — the time-axis is the
108+
version-snapshot LOG (each version is a tagged snapshot of the
109+
dataset; the log is append-only, ordered, and queryable). It does
110+
NOT itself identify which identities changed in each snapshot;
111+
adopters who need a change-set derive it by comparing snapshots
112+
(or by maintaining a separate change-index per their workload's
113+
needs). Codex P2 review on PR #454 caught the prior overclaim
114+
that `versions()` was a 'changed-position index'; corrected:
115+
it's the snapshot LOG, and the change-set derivation is consumer
116+
code. An earlier version of this doc cited `vort/vart` as if it
117+
were a shipped crate (a separate fix). The two shipped surfaces
118+
for the bullet are `NiblePath` (identity) and `versions()`
119+
(snapshot log); the radix-shaped consumer trie and the
120+
change-set index are both consumer code.
104121

105122
Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
106123
indexed graph form is multi-TB with replication factor 3 → multi-TB

0 commit comments

Comments
 (0)