lance-graph-contract: UniCharSet content store — the byte-parity probe's Rust side

claude · claude · commit 5b25aad7449a · 2026-06-17T20:17:44.000Z
The deferred Option A content-store tier, built (operator's "keep building here" + the leptonica-is-an-install-not-a-transcode epiphany). New unicharset module: UniCharSet (deepnsm::Vocabulary-shaped: reverse id->unichar + lookup unichar->id), load_from_str/load_from_file parsing the .unicharset text format (line 1 = count; first whitespace token per line = unichar; id = position; property columns ignored — the old_style_included_ plain-table scope), id_to_unichar/unichar_to_id (the two adapter leaves), and dump() rendering the <id>\t<unichar> table matching the C++ oracle. This is the Rust side of PROBE-OGAR-ADAPTER-UNICHARSET. The unicharset path is pure text parsing — ZERO leptonica (never touches Pix) — so it builds and unit-tests in-env with no C deps. leptonica is only an *install* (a link dep of the C++ oracle harness), never a transcode and never in the Rust path. Byte parity is now one `diff`: combine_tessdata to get eng.unicharset, a ~10-line libtesseract harness dumps id_to_unichar, `cargo run --example unicharset_dump` dumps the Rust side, diff. Byte-identical => CONJECTURE -> FINDING. Additive (sibling content-store module, zero NodeRow/tenant impact). Board: LATEST_STATE Contract Inventory (D-UNICHARSET-1). +4 tests + the unicharset_dump example; 644 contract lib green; clippy -D warnings + fmt clean. The classid->&UniCharSet LazyLock resolver (OGAR wiring) remains the follow-up. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md
@@ -82,6 +82,8 @@
 
 ## Current Contract Inventory (lance-graph-contract)
 
+> **2026-06-17 — ADDED (D-UNICHARSET-1, byte-parity probe Rust side)**: `lance_graph_contract::unicharset::{UniCharSet, UniCharSetError}` — the Tesseract `UNICHARSET` content-store tier (the Core-First doctrine's variable-length classid-keyed registry, `deepnsm::Vocabulary`-shaped: `reverse: Vec<String>` id→unichar + `lookup: HashMap<String,u32>` unichar→id). `load_from_str`/`load_from_file` parse the `.unicharset` text format (line 1 = count, then the first whitespace token per line = unichar, id = position; property columns ignored — the `old_style_included_` plain-table scope); `id_to_unichar`/`unichar_to_id` are the two adapter leaves; `dump()` renders the `<id>\t<unichar>` table matching the C++ oracle. **The Rust side of `PROBE-OGAR-ADAPTER-UNICHARSET`** — pure text parsing, ZERO leptonica (the unicharset path never touches `Pix`), so it builds + unit-tests in-env; byte-parity is one `diff` against a libtesseract oracle harness on a leptonica-installed box (steps in `examples/unicharset_dump.rs`). Additive (a sibling content-store module, zero `NodeRow`/tenant impact). +4 tests (format parse, bijection round-trip, oracle-shape dump, typed errors) + the `unicharset_dump` example; 644 contract lib green; clippy `-D warnings` clean. Plan: `transcode-extend-core-probe-v1.md` (the deferred Option A content-store tier, now built for the probe). The classid→`&UniCharSet` `LazyLock` resolver remains the wiring follow-up.
+
 > **2026-06-17 — ADDED (D-CPP-CODEGEN-1, C-FIRST step 2 compile target)**: `lance_graph_contract::codegen_manifest::{MethodSig, ClassMethods, methods_for}` — the Core-side target of the C++ method-resolution manifest emitted by `ruff_cpp_codegen` (the Tesseract AST-DLL pipeline's stage 2). `MethodSig` is the dispatch-relevant signature in a **`const`-constructible** shape (all fields `&'static`: `name`, `params: &'static [&'static str]`, `ret`, `is_const`, `is_static`, `overrides`) — the method-axis sibling of `class_view::ClassView`'s field projection, deliberately NOT `String`-backed (a generated `const X: &[MethodSig] = &[MethodSig { .. }]` must compile; `FieldRef` is `String`-backed and cannot). `ClassMethods{classid, methods}` is the registry ENTRY the generated code emits (classid bound OGAR-side, never minted here); `methods_for(registry, classid) -> &'static [MethodSig]` is the pure lookup with zero-fallback (unregistered classid → empty slice). **Additive** (container-architect ADDITIVE-CONFIRMED): a sibling module, zero `NodeRow`/`ValueTenant`/`ValueSchema`/stride/`ENVELOPE_LAYOUT_VERSION` impact; the runtime `classid→methods` registry DATA lives downstream (generated in the consumer repo), not here. Body-shaping flags (pure-virtual/constexpr/noexcept/operator/requires) are out of scope (they drive body generation, not the signature manifest). The 8-agent step-2 council's deferred-runtime-registry resolution. +2 tests (const-constructibility proof + zero-fallback lookup); 640 contract lib green; clippy `-D warnings` clean. Plan: `.claude/plans/transcode-extend-core-probe-v1.md` (C step 2). Consumer: `ruff_cpp_codegen::render` (AdaWorldAPI/ruff) names this type in emit-text-only output.
 
 > **2026-06-16 — ADDED (4-task unblock-cascade)**: `lance_graph_contract::hhtl::NiblePath::{from_guid_prefix(&NodeGuid) -> Option<NiblePath>, prefix(depth: u8) -> Option<NiblePath>}` — the ontology-side keystone follow-up of #498's `classid → ReadMode` LE contract. The 20-nibble `classid · HEEL · HIP · TWIG` prefix is deterministically folded to 16 (the canon-reserved high `u16` of classid drops); returns `None` when the fold would be lossy (callers don't get silent collisions). `prefix(d)` is the O(1) single-shot ancestor view that satisfies `prefix(d).is_ancestor_of(self)` for every `d ≤ self.depth` — the routing-cache view of a deeper class path. **One layer up** in `cognitive-shader-driver::MailboxSoA<N>`: `impl MailboxSoaView + MailboxSoaOwner` (cherry-pick of `jolly-cori-clnf9::463d71b`) + the `pub phase: KanbanColumn` field — the in-RAM Rubicon owner the contract's `MailboxSoaOwner` had no real implementor for (integrated-cognitive-planner-v1 §2 Seam #3 closed). In `lance_graph::graph::scheduler`: `LanceVersionScheduler<S = NextPhaseScheduler>` — D-MBX-9-IN core impl over `VersionedGraph::versions()`, generic over the inner `VersionScheduler` policy (closes `E-SUBSTRATE-IS-THE-SCHEDULER`'s OUT-direction). In `surreal_container::view`: `SurrealMailboxView<'a>` + `read_via_kv_lance()` (D-PG-6 contract slice) — the SurrealQL read-glove the integrator wires once the cold-build of the surrealdb fork is taken; the contract surface is available today. Plus `SurrealContainerError::BlockedColdBuild` — typed signal for callers to pattern-match the cold-build gate (distinct from the pre-existing `Blocked` variant which signals coordinate/API gaps). Zero-dep contract additions (+7 hhtl tests, 632 lib green); cognitive-shader-driver +1 driving-loop test (86 lib green); lance-graph::scheduler new module (+5 tests, real tempdir Lance); surreal_container::view new module (+4 tests). All four green; clippy `-D warnings` clean on the new files. EPIPHANIES `E-UNBLOCK-CASCADE-1` records the convergence of three independent landings onto the single `MailboxSoaView` trait surface.
diff --git a/.claude/plans/transcode-extend-core-probe-v1.md b/.claude/plans/transcode-extend-core-probe-v1.md
@@ -435,3 +435,26 @@ the generated crate into tesseract-rs (needs the leptonica build env) and run
 `PROBE-OGAR-ADAPTER-UNICHARSET` byte-parity (Option B) — the only path to
 CONJECTURE→FINDING. Everything to here is CONJECTURE per the doctrine; the
 `PARITY: UNRUN` markers on every generated file say so.
+
+### Unicharset content store + byte-parity probe — Rust side READY (2026-06-17)
+
+The deferred Option A content-store tier, built (the operator chose "keep
+building here" + the leptonica-is-an-install-not-a-transcode epiphany):
+`lance_graph_contract::unicharset::UniCharSet` — `deepnsm::Vocabulary`-shaped
+(`reverse`/`lookup`), `load_from_str`/`load_from_file` + `id_to_unichar` /
+`unichar_to_id` + `dump()`. **Pure text parsing, ZERO leptonica** (the unicharset
+path never touches `Pix`), so it builds + unit-tests in-env (4 tests; 644 lib
+green). The `unicharset_dump` example renders the oracle-shape table.
+
+**The byte-parity probe is now one diff, not a build.** leptonica is an *install*
+(`apt-get libtesseract-dev libleptonica-dev`), never a transcode — it's only a
+*link* dep of the C++ oracle harness, never in the Rust path. So
+`PROBE-OGAR-ADAPTER-UNICHARSET` reduces to: (1) `combine_tessdata -u … eng.` to
+get a real `.unicharset`; (2) a ~10-line C++ harness (`-ltesseract -lleptonica`)
+dumps `id_to_unichar`; (3) `cargo run --example unicharset_dump` dumps the Rust
+side; (4) `diff`. Byte-identical → CONJECTURE→FINDING. Built to the documented
+`old_style_included_` plain-table format; any special-token edge case the real
+`eng.unicharset` shows on first diff is the refine-then item (the falsifier
+loop). The classid→`&UniCharSet` `LazyLock` resolver (the OGAR wiring) and
+transcoding leptonica itself (only for the far-off zero-C end-state, a large
+hand-port — NOT needed for the probe) both remain explicit follow-ups.
diff --git a/crates/lance-graph-contract/examples/unicharset_dump.rs b/crates/lance-graph-contract/examples/unicharset_dump.rs
@@ -0,0 +1,39 @@
+//! Dump a `.unicharset`'s id→unichar table — the Rust side of the byte-parity
+//! probe `PROBE-OGAR-ADAPTER-UNICHARSET`.
+//!
+//! ```sh
+//! # on a box with libtesseract + libleptonica installed:
+//! combine_tessdata -u $(dpkg -L tesseract-ocr-eng | grep eng.traineddata) /tmp/eng.
+//! # C++ oracle (links -lleptonica only to satisfy the linker; never calls it):
+//! #   g++ oracle.cpp -ltesseract -lleptonica -o oracle && ./oracle /tmp/eng.unicharset > /tmp/oracle.tsv
+//! # Rust side:
+//! cargo run -p lance-graph-contract --example unicharset_dump -- /tmp/eng.unicharset > /tmp/rust.tsv
+//! diff /tmp/oracle.tsv /tmp/rust.tsv   # byte-identical => CONJECTURE -> FINDING
+//! ```
+
+#![allow(
+    clippy::print_stdout,
+    reason = "a dump CLI example writes to stdout by design"
+)]
+
+use std::path::Path;
+use std::process::ExitCode;
+
+use lance_graph_contract::unicharset::UniCharSet;
+
+fn main() -> ExitCode {
+    let Some(path) = std::env::args().nth(1) else {
+        eprintln!("usage: unicharset_dump <path/to/eng.unicharset>");
+        return ExitCode::FAILURE;
+    };
+    match UniCharSet::load_from_file(Path::new(&path)) {
+        Ok(unicharset) => {
+            print!("{}", unicharset.dump());
+            ExitCode::SUCCESS
+        }
+        Err(err) => {
+            eprintln!("error: {err}");
+            ExitCode::FAILURE
+        }
+    }
+}
diff --git a/crates/lance-graph-contract/src/lib.rs b/crates/lance-graph-contract/src/lib.rs
@@ -101,6 +101,7 @@ pub mod soa_view;
 pub mod splat;
 pub mod tax;
 pub mod thinking;
+pub mod unicharset;
 pub mod view_angle;
 pub mod vsa;
 pub mod witness_table;
diff --git a/crates/lance-graph-contract/src/unicharset.rs b/crates/lance-graph-contract/src/unicharset.rs
@@ -0,0 +1,219 @@
+//! `UNICHARSET` content store — the Rust side of the byte-parity probe
+//! (`PROBE-OGAR-ADAPTER-UNICHARSET`).
+//!
+//! Tesseract's `UNICHARSET` is a variable-length id↔unichar bijection loaded
+//! from a `.unicharset` text file. Per the Core-First doctrine it is NOT
+//! fixed-width per-node state — it rides a **classid-keyed content-store tier**
+//! shaped exactly like `deepnsm::Vocabulary`: a `reverse: Vec<String>`
+//! (id → unichar) plus a `lookup: HashMap<String, u32>` (unichar → id). This
+//! module is that tier plus the two adapter leaves (`id_to_unichar` /
+//! `unichar_to_id`).
+//!
+//! # Why this is the byte-parity surface
+//!
+//! The unicharset path is pure text parsing — it never touches leptonica or
+//! `Pix`. So the Rust side can be built and tested with **zero C dependencies**.
+//! The probe compares this implementation's [`UniCharSet::dump`] of a real
+//! `eng.unicharset` against the C++ `UNICHARSET::id_to_unichar` oracle (a small
+//! libtesseract harness, which only *links* leptonica, never calls it). Byte-
+//! identical dumps promote the doctrine CONJECTURE → FINDING.
+//!
+//! # Format scope
+//!
+//! The `.unicharset` format is: line 1 = entry count `N`; then `N` lines, each
+//! beginning with the unichar as its first whitespace-delimited token (the
+//! remaining columns — properties / script / bounding boxes — do not affect the
+//! id↔unichar bijection and are ignored). The line position (0-based, after the
+//! count line) IS the unichar id. This is the `old_style_included_ == true`
+//! plain-table scope the adapter-shaper bounded; fragment/`CleanupString`
+//! normalization is a separate, later leaf. Any special-token edge case a real
+//! `eng.unicharset` reveals on first diff is refined then — this is built to the
+//! documented format, diff-pending.
+
+use std::collections::HashMap;
+use std::path::Path;
+
+/// A loaded `UNICHARSET`: the id↔unichar bijection, `deepnsm::Vocabulary`-shaped.
+#[derive(Debug, Clone, Default, PartialEq, Eq)]
+pub struct UniCharSet {
+    /// id → unichar (index IS the id).
+    reverse: Vec<String>,
+    /// unichar → id (the inverse of `reverse`).
+    lookup: HashMap<String, u32>,
+}
+
+impl UniCharSet {
+    /// Parse a `.unicharset` from its text contents. See the module docs for the
+    /// format. Properties columns after the leading unichar token are ignored.
+    ///
+    /// # Errors
+    ///
+    /// [`UniCharSetError::Empty`] if there is no count line,
+    /// [`UniCharSetError::BadCount`] if it is not a non-negative integer, and
+    /// [`UniCharSetError::CountMismatch`] if fewer than `count` entry lines
+    /// follow.
+    pub fn load_from_str(text: &str) -> Result<Self, UniCharSetError> {
+        let mut lines = text.lines();
+        let count: usize = lines
+            .next()
+            .ok_or(UniCharSetError::Empty)?
+            .trim()
+            .parse()
+            .map_err(|_| UniCharSetError::BadCount)?;
+
+        let mut reverse = Vec::with_capacity(count);
+        let mut lookup = HashMap::with_capacity(count);
+        for line in lines.take(count) {
+            // The unichar is the first whitespace-delimited token; the id is the
+            // entry's position. A unichar repeated in the file keeps its FIRST
+            // id in `lookup` (matches a forward-scan loader), but `reverse` keeps
+            // every entry so `id_to_unichar` is exact per position.
+            let unichar = line.split_whitespace().next().unwrap_or("").to_string();
+            let id = u32::try_from(reverse.len()).map_err(|_| UniCharSetError::BadCount)?;
+            lookup.entry(unichar.clone()).or_insert(id);
+            reverse.push(unichar);
+        }
+
+        if reverse.len() != count {
+            return Err(UniCharSetError::CountMismatch {
+                declared: count,
+                found: reverse.len(),
+            });
+        }
+        Ok(Self { reverse, lookup })
+    }
+
+    /// Parse a `.unicharset` file from disk (a thin wrapper over
+    /// [`Self::load_from_str`]).
+    ///
+    /// # Errors
+    ///
+    /// [`UniCharSetError::Io`] if the file cannot be read, else the parse errors
+    /// of [`Self::load_from_str`].
+    pub fn load_from_file(path: &Path) -> Result<Self, UniCharSetError> {
+        let text = std::fs::read_to_string(path).map_err(|e| UniCharSetError::Io(e.to_string()))?;
+        Self::load_from_str(&text)
+    }
+
+    /// Number of entries (the declared count).
+    #[must_use]
+    pub fn size(&self) -> usize {
+        self.reverse.len()
+    }
+
+    /// The unichar string at `id`, or `None` if out of range. The C++ oracle
+    /// for the byte-parity diff.
+    #[must_use]
+    pub fn id_to_unichar(&self, id: u32) -> Option<&str> {
+        self.reverse.get(id as usize).map(String::as_str)
+    }
+
+    /// The id of `unichar`, or `None` if absent (the C++ `INVALID_UNICHAR_ID`
+    /// sentinel maps to `None`; the OGAR adapter boundary re-applies the
+    /// sentinel).
+    #[must_use]
+    pub fn unichar_to_id(&self, unichar: &str) -> Option<u32> {
+        self.lookup.get(unichar).copied()
+    }
+
+    /// Render the id→unichar table as `"<id>\t<unichar>\n"` lines — the exact
+    /// shape the C++ oracle harness prints, so a byte-parity diff is
+    /// `diff oracle_dump.tsv rust_dump.tsv`.
+    #[must_use]
+    pub fn dump(&self) -> String {
+        let mut out = String::new();
+        for (id, unichar) in self.reverse.iter().enumerate() {
+            out.push_str(&id.to_string());
+            out.push('\t');
+            out.push_str(unichar);
+            out.push('\n');
+        }
+        out
+    }
+}
+
+/// A failure loading a `UNICHARSET`.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum UniCharSetError {
+    /// The input had no count line.
+    Empty,
+    /// The count line was not a non-negative integer.
+    BadCount,
+    /// Fewer entry lines than the declared count.
+    CountMismatch {
+        /// The count declared on line 1.
+        declared: usize,
+        /// The number of entry lines actually found.
+        found: usize,
+    },
+    /// The file could not be read (message from the underlying I/O error).
+    Io(String),
+}
+
+impl std::fmt::Display for UniCharSetError {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        match self {
+            Self::Empty => write!(f, "empty unicharset (no count line)"),
+            Self::BadCount => write!(f, "first line is not a valid entry count"),
+            Self::CountMismatch { declared, found } => {
+                write!(f, "declared {declared} entries but found {found}")
+            }
+            Self::Io(msg) => write!(f, "unicharset read failed: {msg}"),
+        }
+    }
+}
+
+impl std::error::Error for UniCharSetError {}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    const SAMPLE: &str = "\
+3
+a 3 0,255,0,255,0,255,0,255,0,255 0 a Left a a
+b 3 0,255,0,255,0,255,0,255,0,255 0 b Left b b
+cd 5 0,255,0,255,0,255,0,255,0,255 0 cd Left cd cd
+";
+
+    #[test]
+    fn parses_count_and_first_token_per_line() {
+        let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
+        assert_eq!(u.size(), 3);
+        assert_eq!(u.id_to_unichar(0), Some("a"));
+        assert_eq!(u.id_to_unichar(2), Some("cd")); // multi-char unichar token
+        assert_eq!(u.id_to_unichar(3), None); // out of range
+    }
+
+    #[test]
+    fn bijection_round_trips() {
+        let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
+        for id in 0..u.size() as u32 {
+            let s = u.id_to_unichar(id).unwrap();
+            assert_eq!(u.unichar_to_id(s), Some(id), "id {id} must round-trip");
+        }
+        assert_eq!(u.unichar_to_id("zzz"), None, "absent unichar -> None");
+    }
+
+    #[test]
+    fn dump_matches_oracle_line_shape() {
+        let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
+        assert_eq!(u.dump(), "0\ta\n1\tb\n2\tcd\n");
+    }
+
+    #[test]
+    fn errors_are_typed() {
+        assert_eq!(UniCharSet::load_from_str(""), Err(UniCharSetError::Empty));
+        assert_eq!(
+            UniCharSet::load_from_str("notanumber\n"),
+            Err(UniCharSetError::BadCount)
+        );
+        assert_eq!(
+            UniCharSet::load_from_str("5\na\nb\n"),
+            Err(UniCharSetError::CountMismatch {
+                declared: 5,
+                found: 2
+            })
+        );
+    }
+}