Skip to content

Commit d854974

Browse files
committed
docs: post-merge corrections on PR #452 + #453 architecture docs
Two follow-up doc fixes after PR #452 and #453 merged. Reviewers flagged real bugs the doc-only PRs landed with; these corrections bring main in line with the post-review state. == CLUSTER_ASYMMETRY.md == Drop the vort/vart adaptive-radix-trie citation; cite shipped NiblePath + Lance versions() instead. A workspace scan across all Cargo.toml files confirms vart is NOT a dependency in this repository; le-domino's own seam-map documents it as 'doc-prose only'. Citing a non-existent crate in a public architecture doc misleads adopters into looking for a crate that they cannot find. The role the bullet described (HHTL-prefix dedup) is actually: - lance-graph-contract::hhtl::NiblePath (shipped) for the identity primitive (16-fan-out nibble path packed into a u64) - Lance versions() (shipped) for the time-axis (cross-session index of which identity positions changed when) Adopters can derive an adaptive radix-trie index over NiblePath addresses themselves; that data structure is consumer code, not a lance-graph dep. The corrected bullet cites the two shipped surfaces and flags the radix-shaped consumer pattern as proposed. == APPEND_ONLY_RAFT_DOVETAIL.md == Apply the same critique class codex raised on PR #453's companion doc: - Scope caveat: peer-Raft + Lance-local is an EXTERNAL architecture pattern (bardioc B1 substrate-b), NOT a built-in lance-graph feature. Adopters provide the Raft layer themselves (openraft / surreal-cluster / external TiKV). Lance-graph contributes the storage-append/consensus-append dovetail property that MAKES the pattern cheap; not the pattern itself. Added a scope banner immediately after the TL;DR. - Compaction honesty: Lance has compaction TOO via DatasetOptimizer.compact_files for fragment layout. The doc previously said 'Lance has no compaction' which is wrong for append-heavy deployments. Rewrote section 1 to distinguish LSM tombstone-reclaim+run-merge compaction from Lance file-layout compaction. Both exist; only LSM coordinates with replication. - Consensus-tax-lands-once section also updated to acknowledge that Lance file compaction runs INDEPENDENTLY of consensus (the storage commit tax and the consensus tax are the same write; the LAYOUT OPTIMIZATION cycle is a separate concern that does not couple). Both files were merged earlier (PR #452, #453); these corrections land in main as a follow-up so adopters reading the docs today see the honest scope + correct citations. No code changes; doc-only.
1 parent 16f879b commit d854974

2 files changed

Lines changed: 81 additions & 31 deletions

File tree

docs/APPEND_ONLY_RAFT_DOVETAIL.md

Lines changed: 66 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,24 @@ or any other model where storage mutates state in place.
1212
This doc names the property explicitly + lists the operational
1313
consequences so adopters can choose the right deployment shape.
1414

15+
> ### Scope: external architecture pattern, not a built-in lance-graph feature
16+
>
17+
> Post-merge correction (paralleling peer review on the companion doc
18+
> PR #453): the deployment shape described below ("peer-Raft +
19+
> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters
20+
> can build on top of lance-graph, NOT a built-in lance-graph
21+
> capability. Lance-graph provides the append-only columnar storage +
22+
> the DataFusion query path + the encoding crates; the Raft layer +
23+
> the substrate binary + the consensus-replication path are
24+
> downstream consumer code (e.g. `openraft` or `surreal-cluster`).
25+
> Adopters who only consume lance-graph's columnar + DataFusion path
26+
> should NOT assume their data is automatically replicated.
27+
>
28+
> The doc documents WHY this pattern works well WHEN built on
29+
> lance-graph — the storage-append/consensus-append dovetail property
30+
> — not a feature lance-graph itself ships.
31+
32+
1533
## The two write shapes that have to align
1634

1735
A distributed Lance deployment has two write paths:
@@ -35,15 +53,16 @@ Compare this with the conventional alternatives:
3553
|---|---|---|
3654
| **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
3755
| **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
38-
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. |
56+
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. (Lance does have its own file-layout compaction via `DatasetOptimizer.compact_files`, but it runs INDEPENDENTLY of consensus — see Operational consequence #1.) |
3957

4058
## Operational consequences
4159

42-
### 1. No compaction storms
60+
### 1. Compaction is qualitatively different, not absent
4361

44-
Cassandra clusters periodically run compaction (rewrite SSTables to
45-
reclaim space + maintain read performance). Each node compacts on
46-
its own schedule. During compaction:
62+
Cassandra clusters periodically run LSM compaction (rewrite SSTables
63+
to reclaim tombstones + merge sorted runs to maintain read
64+
performance). Each node compacts on its own schedule. During
65+
compaction:
4766

4867
- The compacting node's CPU spikes
4968
- Its disk write bandwidth spikes
@@ -56,15 +75,31 @@ The cluster operator's job is partly to schedule compactions across
5675
nodes so that not too many compact simultaneously. This is a
5776
significant operational burden.
5877

59-
Lance has no compaction. The version log IS the truth; old fragments
60-
can be reclaimed by version-based GC (a much simpler operation than
61-
SSTable compaction) but the GC is local-only, doesn't interact with
62-
replication, and doesn't reorder anything.
78+
Lance has compaction TOO, but of a qualitatively different shape:
79+
`DatasetOptimizer.compact_files` merges small fragments into larger
80+
ones to optimize query layout (many small appends produce many small
81+
fragments which slow scans). It is NOT a tombstone-reclaim cycle —
82+
Lance is append-only at the version level, so there are no tombstones
83+
to reclaim in the LSM sense. File compaction takes existing
84+
append-only fragments and produces new append-only fragments at a
85+
better layout.
86+
87+
Operationally:
88+
89+
- Compaction runs INDEPENDENTLY of consensus replication — file
90+
compaction does not block writes, does not affect Raft log shipping,
91+
does not coordinate across nodes
92+
- Per-node compaction is local-only; each node compacts its own
93+
Lance dataset on its own schedule without affecting peers
94+
- The failure modes are smaller (a partial compaction is recoverable;
95+
no in-flight tombstones to lose; correctness is unaffected)
6396

6497
A peer-Raft + Lance deployment therefore has uniform per-node
65-
behavior. Each node is doing the same work at the same time, with
66-
the same shape. The operations runbook is simpler because the
67-
failure modes are simpler.
98+
behavior under both consensus and compaction. Operators still plan
99+
for file compaction (it consumes CPU + IO when it runs), but the
100+
cluster-wide coordination burden is significantly lower than
101+
Cassandra's LSM compaction scheduling. (Per post-merge correction
102+
on PR #452.)
68103

69104
### 2. Anti-entropy is a hash compare, not a Merkle-tree walk
70105

@@ -109,19 +144,25 @@ footprint depends on which columns mutated, whether the row was new
109144
or updated, whether the column had a previous value. Cross-DC
110145
replication budget is harder to plan.
111146

112-
### 5. The consensus tax lands once, not twice
113-
114-
This is the unifying point: with non-append-only storage, an
115-
application that wants linearizable writes pays the consensus tax
116-
TWICE. Once for the consensus protocol shipping operations to
117-
replicas. Once for the storage layer doing per-node compaction +
118-
mutation bookkeeping. The two taxes interact (a compaction storm
119-
delays consensus catch-up; a Raft snapshot has to materialize the
120-
LSM-tree state).
121-
122-
With Lance + Raft, the consensus tax and the storage tax are the
123-
SAME tax. The append IS both the consensus log entry and the storage
124-
commit. You pay it once.
147+
### 5. The consensus tax and the storage-COMMIT tax are the same tax
148+
149+
This is the unifying point: with LSM-tree storage, an application
150+
that wants linearizable writes pays the consensus tax TWICE. Once
151+
for the consensus protocol shipping operations to replicas. Once for
152+
the storage layer doing per-node tombstone-reclaim + run-merge
153+
compaction. The two taxes interact (a compaction storm delays
154+
consensus catch-up; a Raft snapshot has to materialize the LSM-tree
155+
state).
156+
157+
With Lance + Raft, the consensus tax and the storage-COMMIT tax are
158+
the SAME tax — the append IS both the consensus log entry and the
159+
storage commit; you pay it once. Lance does have its own file-
160+
compaction cycle (`DatasetOptimizer.compact_files`), but it runs
161+
INDEPENDENTLY of consensus — file compaction takes existing
162+
append-only fragments and rewrites them into bigger append-only
163+
fragments without tombstones to reclaim and without coordinating
164+
with replication. So the LAYOUT-OPTIMIZATION cycle exists but does
165+
not interact with the consensus tax.
125166

126167
## What this implies for deployment shape
127168

docs/CLUSTER_ASYMMETRY.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -95,12 +95,21 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations:
9595
reduces ANN scan space by ~16× under empirical intra-family
9696
locality (98.6% per the `lance-graph` PR #444 probe).
9797

98-
- **vort/vart adaptive radix trie**: structural deduplication of
99-
shared HHTL prefixes. The heel + hip nibbles common across many
100-
entities are stored ONCE per path segment, not N times. Adaptive
101-
Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
102-
structure also serves as the time-axis index over Lance
103-
`versions()` for cold-path queries.
98+
- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance
99+
`versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed
100+
into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want
101+
to dedupe shared HHTL prefixes in memory typically derive an
102+
adaptive radix-trie index over `NiblePath` addresses — heel + hip
103+
nibbles common across many entities can be stored once per path
104+
segment via consumer-side structures (O(k) lookup, prefix-sharing).
105+
**The dedup-by-prefix data structure itself is consumer code, not
106+
a built-in lance-graph crate.** Lance's own `versions()` log is the
107+
time-axis (cross-session index of which identity positions changed
108+
when). An earlier version of this doc cited `vort/vart` as if it
109+
were a shipped crate; corrected per peer review — the radix-shaped
110+
trie at the cognitive layer is a proposed pattern (no shipped crate
111+
name) and the identity primitive + the time-axis are the two
112+
shipped surfaces this bullet should have cited from the start.
104113

105114
Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
106115
indexed graph form is multi-TB with replication factor 3 → multi-TB

0 commit comments

Comments
 (0)