Merge pull request #454 from AdaWorldAPI/docs/post-merge-corrections-architecture

AdaWorldAPI · web-flow · commit 2d78b853799a · 2026-06-03T14:41:08.000+02:00
docs: post-merge corrections on PR #452 + #453 architecture docs
diff --git a/docs/APPEND_ONLY_RAFT_DOVETAIL.md b/docs/APPEND_ONLY_RAFT_DOVETAIL.md
@@ -12,6 +12,24 @@ or any other model where storage mutates state in place.
 This doc names the property explicitly + lists the operational
 consequences so adopters can choose the right deployment shape.
 
+> ### Scope: external architecture pattern, not a built-in lance-graph feature
+>
+> Post-merge correction (paralleling peer review on the companion doc
+> PR #453): the deployment shape described below ("peer-Raft +
+> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters
+> can build on top of lance-graph, NOT a built-in lance-graph
+> capability. Lance-graph provides the append-only columnar storage +
+> the DataFusion query path + the encoding crates; the Raft layer +
+> the substrate binary + the consensus-replication path are
+> downstream consumer code (e.g. `openraft` or `surreal-cluster`).
+> Adopters who only consume lance-graph's columnar + DataFusion path
+> should NOT assume their data is automatically replicated.
+>
+> The doc documents WHY this pattern works well WHEN built on
+> lance-graph — the storage-append/consensus-append dovetail property
+> — not a feature lance-graph itself ships.
+
+
 ## The two write shapes that have to align
 
 A distributed Lance deployment has two write paths:
@@ -35,15 +53,16 @@ Compare this with the conventional alternatives:
 |---|---|---|
 | **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
 | **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
-| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. |
+| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) |
 
 ## Operational consequences
 
-### 1. No compaction storms
+### 1. Compaction is qualitatively different, not absent
 
-Cassandra clusters periodically run compaction (rewrite SSTables to
-reclaim space + maintain read performance). Each node compacts on
-its own schedule. During compaction:
+Cassandra clusters periodically run LSM compaction (rewrite SSTables
+to reclaim tombstones + merge sorted runs to maintain read
+performance). Each node compacts on its own schedule. During
+compaction:
 
 - The compacting node's CPU spikes
 - Its disk write bandwidth spikes
@@ -56,15 +75,54 @@ The cluster operator's job is partly to schedule compactions across
 nodes so that not too many compact simultaneously. This is a
 significant operational burden.
 
-Lance has no compaction. The version log IS the truth; old fragments
-can be reclaimed by version-based GC (a much simpler operation than
-SSTable compaction) but the GC is local-only, doesn't interact with
-replication, and doesn't reorder anything.
+Lance has compaction TOO, but of a qualitatively different shape:
+`DatasetOptimizer.compact_files` merges small fragments into larger
+ones to optimize query layout (many small appends produce many small
+fragments which slow scans). For datasets that use deletes, updates,
+or dropped columns, the SAME compaction also performs reclamation —
+deletion vectors get materialized away (removing rows logically
+marked for delete) and dropped columns are physically removed by
+default. So there IS a reclamation role on those datasets; it is
+just NOT the LSM tombstone-reclaim mechanism (Lance has no
+tombstones at the version level — deletions are tracked by
+deletion-vectors against append-only fragments). For append-only
+write workloads with no deletes/updates/drops, the layout-only
+framing applies; for mixed-write workloads, the reclamation
+component is also present. Either way the operation produces new
+append-only fragments at a better layout.
+
+Operationally:
+
+- The compaction OPERATION runs locally on whichever node has the
+  current leader role (or on a node permitted to run a maintenance
+  task in the chosen deployment shape); it does not block normal
+  write flow at the application layer
+- The OUTPUT of compaction — the new manifest version + the new
+  set of fragments — flows through the Raft log like any other
+  Lance commit. Peers see the new manifest version after consensus
+  commits, and anti-entropy converges replicas to the post-compaction
+  state. So the result REPLICATES; the work that produced the result
+  is what runs in one place
+- Per-node SCHEDULING choices do not stack into coordination
+  headaches the way Cassandra LSM scheduling does, because each
+  compaction's product is a single committed version (not a
+  per-replica concurrent rewrite that has to be reconciled). At most
+  one node should run a given compaction at a time to avoid wasted
+  work; this is a coordination choice (lock or leader-only), not a
+  coordination headache
+- The failure modes are smaller: a partial compaction is recoverable
+  via Raft's standard log replay; no in-flight LSM tombstones to
+  lose; correctness is unaffected
 
 A peer-Raft + Lance deployment therefore has uniform per-node
-behavior. Each node is doing the same work at the same time, with
-the same shape. The operations runbook is simpler because the
-failure modes are simpler.
+behavior under consensus. Compaction is a maintenance operation
+that produces a normal commit; operators plan for it (it consumes
+CPU + IO when it runs) but the cluster-wide coordination model is
+simpler than Cassandra's per-node-independent LSM compaction
+scheduling. (Per post-merge correction on PR #452; sharpened per
+codex P2 review on PR #454 — the prior framing said 'independent
+of consensus' which overclaimed; the operation is local but the
+output replicates.)
 
 ### 2. Anti-entropy is a hash compare, not a Merkle-tree walk
 
@@ -109,19 +167,30 @@ footprint depends on which columns mutated, whether the row was new
 or updated, whether the column had a previous value. Cross-DC
 replication budget is harder to plan.
 
-### 5. The consensus tax lands once, not twice
-
-This is the unifying point: with non-append-only storage, an
-application that wants linearizable writes pays the consensus tax
-TWICE. Once for the consensus protocol shipping operations to
-replicas. Once for the storage layer doing per-node compaction +
-mutation bookkeeping. The two taxes interact (a compaction storm
-delays consensus catch-up; a Raft snapshot has to materialize the
-LSM-tree state).
-
-With Lance + Raft, the consensus tax and the storage tax are the
-SAME tax. The append IS both the consensus log entry and the storage
-commit. You pay it once.
+### 5. The consensus tax and the storage-COMMIT tax are the same tax
+
+This is the unifying point: with LSM-tree storage, an application
+that wants linearizable writes pays the consensus tax TWICE. Once
+for the consensus protocol shipping operations to replicas. Once for
+the storage layer doing per-node tombstone-reclaim + run-merge
+compaction. The two taxes interact (a compaction storm delays
+consensus catch-up; a Raft snapshot has to materialize the LSM-tree
+state).
+
+With Lance + Raft, the consensus tax and the storage-COMMIT tax are
+the SAME tax — the append IS both the consensus log entry and the
+storage commit; you pay it once. Lance does have its own file-
+compaction cycle (`DatasetOptimizer.compact_files`), which produces
+a NEW manifest version — and that new version flows through the
+same Raft log as any other write, replicating to peers via the
+normal consensus + anti-entropy path. So compaction's OUTPUT
+counts as a consensus event (one more append). What it does NOT
+add is a SECOND tax of the LSM-tree kind (per-node tombstone-
+reclaim + run-merge bookkeeping that runs on every replica
+independently and creates coordination headaches with replication).
+The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus
+tax as a regular write (one commit), and does NOT layer a separate
+per-replica storage tax on top.
 
 ## What this implies for deployment shape
 
diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md
@@ -95,12 +95,29 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations:
   reduces ANN scan space by ~16× under empirical intra-family
   locality (98.6% per the `lance-graph` PR #444 probe).
 
-- **vort/vart adaptive radix trie**: structural deduplication of
-  shared HHTL prefixes. The heel + hip nibbles common across many
-  entities are stored ONCE per path segment, not N times. Adaptive
-  Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
-  structure also serves as the time-axis index over Lance
-  `versions()` for cold-path queries.
+- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance
+  `versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed
+  into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want
+  to dedupe shared HHTL prefixes in memory typically derive an
+  adaptive radix-trie index over `NiblePath` addresses — heel + hip
+  nibbles common across many entities can be stored once per path
+  segment via consumer-side structures (O(k) lookup, prefix-sharing).
+  **The dedup-by-prefix data structure itself is consumer code, not
+  a built-in lance-graph crate.** Lance's `versions()` returns
+  `Vec<lance::dataset::Version>` — the time-axis is the
+  version-snapshot LOG (each version is a tagged snapshot of the
+  dataset; the log is append-only, ordered, and queryable). It does
+  NOT itself identify which identities changed in each snapshot;
+  adopters who need a change-set derive it by comparing snapshots
+  (or by maintaining a separate change-index per their workload's
+  needs). Codex P2 review on PR #454 caught the prior overclaim
+  that `versions()` was a 'changed-position index'; corrected:
+  it's the snapshot LOG, and the change-set derivation is consumer
+  code. An earlier version of this doc cited `vort/vart` as if it
+  were a shipped crate (a separate fix). The two shipped surfaces
+  for the bullet are `NiblePath` (identity) and `versions()`
+  (snapshot log); the radix-shaped consumer trie and the
+  change-set index are both consumer code.
 
 Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
 indexed graph form is multi-TB with replication factor 3 → multi-TB