Skip to content

Commit 5b25aad

Browse files
committed
lance-graph-contract: UniCharSet content store — the byte-parity probe's Rust side
The deferred Option A content-store tier, built (operator's "keep building here" + the leptonica-is-an-install-not-a-transcode epiphany). New unicharset module: UniCharSet (deepnsm::Vocabulary-shaped: reverse id->unichar + lookup unichar->id), load_from_str/load_from_file parsing the .unicharset text format (line 1 = count; first whitespace token per line = unichar; id = position; property columns ignored — the old_style_included_ plain-table scope), id_to_unichar/unichar_to_id (the two adapter leaves), and dump() rendering the <id>\t<unichar> table matching the C++ oracle. This is the Rust side of PROBE-OGAR-ADAPTER-UNICHARSET. The unicharset path is pure text parsing — ZERO leptonica (never touches Pix) — so it builds and unit-tests in-env with no C deps. leptonica is only an *install* (a link dep of the C++ oracle harness), never a transcode and never in the Rust path. Byte parity is now one `diff`: combine_tessdata to get eng.unicharset, a ~10-line libtesseract harness dumps id_to_unichar, `cargo run --example unicharset_dump` dumps the Rust side, diff. Byte-identical => CONJECTURE -> FINDING. Additive (sibling content-store module, zero NodeRow/tenant impact). Board: LATEST_STATE Contract Inventory (D-UNICHARSET-1). +4 tests + the unicharset_dump example; 644 contract lib green; clippy -D warnings + fmt clean. The classid->&UniCharSet LazyLock resolver (OGAR wiring) remains the follow-up. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
1 parent d3a268c commit 5b25aad

5 files changed

Lines changed: 284 additions & 0 deletions

File tree

.claude/board/LATEST_STATE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@
8282

8383
## Current Contract Inventory (lance-graph-contract)
8484

85+
> **2026-06-17 — ADDED (D-UNICHARSET-1, byte-parity probe Rust side)**: `lance_graph_contract::unicharset::{UniCharSet, UniCharSetError}` — the Tesseract `UNICHARSET` content-store tier (the Core-First doctrine's variable-length classid-keyed registry, `deepnsm::Vocabulary`-shaped: `reverse: Vec<String>` id→unichar + `lookup: HashMap<String,u32>` unichar→id). `load_from_str`/`load_from_file` parse the `.unicharset` text format (line 1 = count, then the first whitespace token per line = unichar, id = position; property columns ignored — the `old_style_included_` plain-table scope); `id_to_unichar`/`unichar_to_id` are the two adapter leaves; `dump()` renders the `<id>\t<unichar>` table matching the C++ oracle. **The Rust side of `PROBE-OGAR-ADAPTER-UNICHARSET`** — pure text parsing, ZERO leptonica (the unicharset path never touches `Pix`), so it builds + unit-tests in-env; byte-parity is one `diff` against a libtesseract oracle harness on a leptonica-installed box (steps in `examples/unicharset_dump.rs`). Additive (a sibling content-store module, zero `NodeRow`/tenant impact). +4 tests (format parse, bijection round-trip, oracle-shape dump, typed errors) + the `unicharset_dump` example; 644 contract lib green; clippy `-D warnings` clean. Plan: `transcode-extend-core-probe-v1.md` (the deferred Option A content-store tier, now built for the probe). The classid→`&UniCharSet` `LazyLock` resolver remains the wiring follow-up.
86+
8587
> **2026-06-17 — ADDED (D-CPP-CODEGEN-1, C-FIRST step 2 compile target)**: `lance_graph_contract::codegen_manifest::{MethodSig, ClassMethods, methods_for}` — the Core-side target of the C++ method-resolution manifest emitted by `ruff_cpp_codegen` (the Tesseract AST-DLL pipeline's stage 2). `MethodSig` is the dispatch-relevant signature in a **`const`-constructible** shape (all fields `&'static`: `name`, `params: &'static [&'static str]`, `ret`, `is_const`, `is_static`, `overrides`) — the method-axis sibling of `class_view::ClassView`'s field projection, deliberately NOT `String`-backed (a generated `const X: &[MethodSig] = &[MethodSig { .. }]` must compile; `FieldRef` is `String`-backed and cannot). `ClassMethods{classid, methods}` is the registry ENTRY the generated code emits (classid bound OGAR-side, never minted here); `methods_for(registry, classid) -> &'static [MethodSig]` is the pure lookup with zero-fallback (unregistered classid → empty slice). **Additive** (container-architect ADDITIVE-CONFIRMED): a sibling module, zero `NodeRow`/`ValueTenant`/`ValueSchema`/stride/`ENVELOPE_LAYOUT_VERSION` impact; the runtime `classid→methods` registry DATA lives downstream (generated in the consumer repo), not here. Body-shaping flags (pure-virtual/constexpr/noexcept/operator/requires) are out of scope (they drive body generation, not the signature manifest). The 8-agent step-2 council's deferred-runtime-registry resolution. +2 tests (const-constructibility proof + zero-fallback lookup); 640 contract lib green; clippy `-D warnings` clean. Plan: `.claude/plans/transcode-extend-core-probe-v1.md` (C step 2). Consumer: `ruff_cpp_codegen::render` (AdaWorldAPI/ruff) names this type in emit-text-only output.
8688

8789
> **2026-06-16 — ADDED (4-task unblock-cascade)**: `lance_graph_contract::hhtl::NiblePath::{from_guid_prefix(&NodeGuid) -> Option<NiblePath>, prefix(depth: u8) -> Option<NiblePath>}` — the ontology-side keystone follow-up of #498's `classid → ReadMode` LE contract. The 20-nibble `classid · HEEL · HIP · TWIG` prefix is deterministically folded to 16 (the canon-reserved high `u16` of classid drops); returns `None` when the fold would be lossy (callers don't get silent collisions). `prefix(d)` is the O(1) single-shot ancestor view that satisfies `prefix(d).is_ancestor_of(self)` for every `d ≤ self.depth` — the routing-cache view of a deeper class path. **One layer up** in `cognitive-shader-driver::MailboxSoA<N>`: `impl MailboxSoaView + MailboxSoaOwner` (cherry-pick of `jolly-cori-clnf9::463d71b`) + the `pub phase: KanbanColumn` field — the in-RAM Rubicon owner the contract's `MailboxSoaOwner` had no real implementor for (integrated-cognitive-planner-v1 §2 Seam #3 closed). In `lance_graph::graph::scheduler`: `LanceVersionScheduler<S = NextPhaseScheduler>` — D-MBX-9-IN core impl over `VersionedGraph::versions()`, generic over the inner `VersionScheduler` policy (closes `E-SUBSTRATE-IS-THE-SCHEDULER`'s OUT-direction). In `surreal_container::view`: `SurrealMailboxView<'a>` + `read_via_kv_lance()` (D-PG-6 contract slice) — the SurrealQL read-glove the integrator wires once the cold-build of the surrealdb fork is taken; the contract surface is available today. Plus `SurrealContainerError::BlockedColdBuild` — typed signal for callers to pattern-match the cold-build gate (distinct from the pre-existing `Blocked` variant which signals coordinate/API gaps). Zero-dep contract additions (+7 hhtl tests, 632 lib green); cognitive-shader-driver +1 driving-loop test (86 lib green); lance-graph::scheduler new module (+5 tests, real tempdir Lance); surreal_container::view new module (+4 tests). All four green; clippy `-D warnings` clean on the new files. EPIPHANIES `E-UNBLOCK-CASCADE-1` records the convergence of three independent landings onto the single `MailboxSoaView` trait surface.

.claude/plans/transcode-extend-core-probe-v1.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,3 +435,26 @@ the generated crate into tesseract-rs (needs the leptonica build env) and run
435435
`PROBE-OGAR-ADAPTER-UNICHARSET` byte-parity (Option B) — the only path to
436436
CONJECTURE→FINDING. Everything to here is CONJECTURE per the doctrine; the
437437
`PARITY: UNRUN` markers on every generated file say so.
438+
439+
### Unicharset content store + byte-parity probe — Rust side READY (2026-06-17)
440+
441+
The deferred Option A content-store tier, built (the operator chose "keep
442+
building here" + the leptonica-is-an-install-not-a-transcode epiphany):
443+
`lance_graph_contract::unicharset::UniCharSet``deepnsm::Vocabulary`-shaped
444+
(`reverse`/`lookup`), `load_from_str`/`load_from_file` + `id_to_unichar` /
445+
`unichar_to_id` + `dump()`. **Pure text parsing, ZERO leptonica** (the unicharset
446+
path never touches `Pix`), so it builds + unit-tests in-env (4 tests; 644 lib
447+
green). The `unicharset_dump` example renders the oracle-shape table.
448+
449+
**The byte-parity probe is now one diff, not a build.** leptonica is an *install*
450+
(`apt-get libtesseract-dev libleptonica-dev`), never a transcode — it's only a
451+
*link* dep of the C++ oracle harness, never in the Rust path. So
452+
`PROBE-OGAR-ADAPTER-UNICHARSET` reduces to: (1) `combine_tessdata -u … eng.` to
453+
get a real `.unicharset`; (2) a ~10-line C++ harness (`-ltesseract -lleptonica`)
454+
dumps `id_to_unichar`; (3) `cargo run --example unicharset_dump` dumps the Rust
455+
side; (4) `diff`. Byte-identical → CONJECTURE→FINDING. Built to the documented
456+
`old_style_included_` plain-table format; any special-token edge case the real
457+
`eng.unicharset` shows on first diff is the refine-then item (the falsifier
458+
loop). The classid→`&UniCharSet` `LazyLock` resolver (the OGAR wiring) and
459+
transcoding leptonica itself (only for the far-off zero-C end-state, a large
460+
hand-port — NOT needed for the probe) both remain explicit follow-ups.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
//! Dump a `.unicharset`'s id→unichar table — the Rust side of the byte-parity
2+
//! probe `PROBE-OGAR-ADAPTER-UNICHARSET`.
3+
//!
4+
//! ```sh
5+
//! # on a box with libtesseract + libleptonica installed:
6+
//! combine_tessdata -u $(dpkg -L tesseract-ocr-eng | grep eng.traineddata) /tmp/eng.
7+
//! # C++ oracle (links -lleptonica only to satisfy the linker; never calls it):
8+
//! # g++ oracle.cpp -ltesseract -lleptonica -o oracle && ./oracle /tmp/eng.unicharset > /tmp/oracle.tsv
9+
//! # Rust side:
10+
//! cargo run -p lance-graph-contract --example unicharset_dump -- /tmp/eng.unicharset > /tmp/rust.tsv
11+
//! diff /tmp/oracle.tsv /tmp/rust.tsv # byte-identical => CONJECTURE -> FINDING
12+
//! ```
13+
14+
#![allow(
15+
clippy::print_stdout,
16+
reason = "a dump CLI example writes to stdout by design"
17+
)]
18+
19+
use std::path::Path;
20+
use std::process::ExitCode;
21+
22+
use lance_graph_contract::unicharset::UniCharSet;
23+
24+
fn main() -> ExitCode {
25+
let Some(path) = std::env::args().nth(1) else {
26+
eprintln!("usage: unicharset_dump <path/to/eng.unicharset>");
27+
return ExitCode::FAILURE;
28+
};
29+
match UniCharSet::load_from_file(Path::new(&path)) {
30+
Ok(unicharset) => {
31+
print!("{}", unicharset.dump());
32+
ExitCode::SUCCESS
33+
}
34+
Err(err) => {
35+
eprintln!("error: {err}");
36+
ExitCode::FAILURE
37+
}
38+
}
39+
}

crates/lance-graph-contract/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ pub mod soa_view;
101101
pub mod splat;
102102
pub mod tax;
103103
pub mod thinking;
104+
pub mod unicharset;
104105
pub mod view_angle;
105106
pub mod vsa;
106107
pub mod witness_table;
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
//! `UNICHARSET` content store — the Rust side of the byte-parity probe
2+
//! (`PROBE-OGAR-ADAPTER-UNICHARSET`).
3+
//!
4+
//! Tesseract's `UNICHARSET` is a variable-length id↔unichar bijection loaded
5+
//! from a `.unicharset` text file. Per the Core-First doctrine it is NOT
6+
//! fixed-width per-node state — it rides a **classid-keyed content-store tier**
7+
//! shaped exactly like `deepnsm::Vocabulary`: a `reverse: Vec<String>`
8+
//! (id → unichar) plus a `lookup: HashMap<String, u32>` (unichar → id). This
9+
//! module is that tier plus the two adapter leaves (`id_to_unichar` /
10+
//! `unichar_to_id`).
11+
//!
12+
//! # Why this is the byte-parity surface
13+
//!
14+
//! The unicharset path is pure text parsing — it never touches leptonica or
15+
//! `Pix`. So the Rust side can be built and tested with **zero C dependencies**.
16+
//! The probe compares this implementation's [`UniCharSet::dump`] of a real
17+
//! `eng.unicharset` against the C++ `UNICHARSET::id_to_unichar` oracle (a small
18+
//! libtesseract harness, which only *links* leptonica, never calls it). Byte-
19+
//! identical dumps promote the doctrine CONJECTURE → FINDING.
20+
//!
21+
//! # Format scope
22+
//!
23+
//! The `.unicharset` format is: line 1 = entry count `N`; then `N` lines, each
24+
//! beginning with the unichar as its first whitespace-delimited token (the
25+
//! remaining columns — properties / script / bounding boxes — do not affect the
26+
//! id↔unichar bijection and are ignored). The line position (0-based, after the
27+
//! count line) IS the unichar id. This is the `old_style_included_ == true`
28+
//! plain-table scope the adapter-shaper bounded; fragment/`CleanupString`
29+
//! normalization is a separate, later leaf. Any special-token edge case a real
30+
//! `eng.unicharset` reveals on first diff is refined then — this is built to the
31+
//! documented format, diff-pending.
32+
33+
use std::collections::HashMap;
34+
use std::path::Path;
35+
36+
/// A loaded `UNICHARSET`: the id↔unichar bijection, `deepnsm::Vocabulary`-shaped.
37+
#[derive(Debug, Clone, Default, PartialEq, Eq)]
38+
pub struct UniCharSet {
39+
/// id → unichar (index IS the id).
40+
reverse: Vec<String>,
41+
/// unichar → id (the inverse of `reverse`).
42+
lookup: HashMap<String, u32>,
43+
}
44+
45+
impl UniCharSet {
46+
/// Parse a `.unicharset` from its text contents. See the module docs for the
47+
/// format. Properties columns after the leading unichar token are ignored.
48+
///
49+
/// # Errors
50+
///
51+
/// [`UniCharSetError::Empty`] if there is no count line,
52+
/// [`UniCharSetError::BadCount`] if it is not a non-negative integer, and
53+
/// [`UniCharSetError::CountMismatch`] if fewer than `count` entry lines
54+
/// follow.
55+
pub fn load_from_str(text: &str) -> Result<Self, UniCharSetError> {
56+
let mut lines = text.lines();
57+
let count: usize = lines
58+
.next()
59+
.ok_or(UniCharSetError::Empty)?
60+
.trim()
61+
.parse()
62+
.map_err(|_| UniCharSetError::BadCount)?;
63+
64+
let mut reverse = Vec::with_capacity(count);
65+
let mut lookup = HashMap::with_capacity(count);
66+
for line in lines.take(count) {
67+
// The unichar is the first whitespace-delimited token; the id is the
68+
// entry's position. A unichar repeated in the file keeps its FIRST
69+
// id in `lookup` (matches a forward-scan loader), but `reverse` keeps
70+
// every entry so `id_to_unichar` is exact per position.
71+
let unichar = line.split_whitespace().next().unwrap_or("").to_string();
72+
let id = u32::try_from(reverse.len()).map_err(|_| UniCharSetError::BadCount)?;
73+
lookup.entry(unichar.clone()).or_insert(id);
74+
reverse.push(unichar);
75+
}
76+
77+
if reverse.len() != count {
78+
return Err(UniCharSetError::CountMismatch {
79+
declared: count,
80+
found: reverse.len(),
81+
});
82+
}
83+
Ok(Self { reverse, lookup })
84+
}
85+
86+
/// Parse a `.unicharset` file from disk (a thin wrapper over
87+
/// [`Self::load_from_str`]).
88+
///
89+
/// # Errors
90+
///
91+
/// [`UniCharSetError::Io`] if the file cannot be read, else the parse errors
92+
/// of [`Self::load_from_str`].
93+
pub fn load_from_file(path: &Path) -> Result<Self, UniCharSetError> {
94+
let text = std::fs::read_to_string(path).map_err(|e| UniCharSetError::Io(e.to_string()))?;
95+
Self::load_from_str(&text)
96+
}
97+
98+
/// Number of entries (the declared count).
99+
#[must_use]
100+
pub fn size(&self) -> usize {
101+
self.reverse.len()
102+
}
103+
104+
/// The unichar string at `id`, or `None` if out of range. The C++ oracle
105+
/// for the byte-parity diff.
106+
#[must_use]
107+
pub fn id_to_unichar(&self, id: u32) -> Option<&str> {
108+
self.reverse.get(id as usize).map(String::as_str)
109+
}
110+
111+
/// The id of `unichar`, or `None` if absent (the C++ `INVALID_UNICHAR_ID`
112+
/// sentinel maps to `None`; the OGAR adapter boundary re-applies the
113+
/// sentinel).
114+
#[must_use]
115+
pub fn unichar_to_id(&self, unichar: &str) -> Option<u32> {
116+
self.lookup.get(unichar).copied()
117+
}
118+
119+
/// Render the id→unichar table as `"<id>\t<unichar>\n"` lines — the exact
120+
/// shape the C++ oracle harness prints, so a byte-parity diff is
121+
/// `diff oracle_dump.tsv rust_dump.tsv`.
122+
#[must_use]
123+
pub fn dump(&self) -> String {
124+
let mut out = String::new();
125+
for (id, unichar) in self.reverse.iter().enumerate() {
126+
out.push_str(&id.to_string());
127+
out.push('\t');
128+
out.push_str(unichar);
129+
out.push('\n');
130+
}
131+
out
132+
}
133+
}
134+
135+
/// A failure loading a `UNICHARSET`.
136+
#[derive(Debug, Clone, PartialEq, Eq)]
137+
pub enum UniCharSetError {
138+
/// The input had no count line.
139+
Empty,
140+
/// The count line was not a non-negative integer.
141+
BadCount,
142+
/// Fewer entry lines than the declared count.
143+
CountMismatch {
144+
/// The count declared on line 1.
145+
declared: usize,
146+
/// The number of entry lines actually found.
147+
found: usize,
148+
},
149+
/// The file could not be read (message from the underlying I/O error).
150+
Io(String),
151+
}
152+
153+
impl std::fmt::Display for UniCharSetError {
154+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
155+
match self {
156+
Self::Empty => write!(f, "empty unicharset (no count line)"),
157+
Self::BadCount => write!(f, "first line is not a valid entry count"),
158+
Self::CountMismatch { declared, found } => {
159+
write!(f, "declared {declared} entries but found {found}")
160+
}
161+
Self::Io(msg) => write!(f, "unicharset read failed: {msg}"),
162+
}
163+
}
164+
}
165+
166+
impl std::error::Error for UniCharSetError {}
167+
168+
#[cfg(test)]
169+
mod tests {
170+
use super::*;
171+
172+
const SAMPLE: &str = "\
173+
3
174+
a 3 0,255,0,255,0,255,0,255,0,255 0 a Left a a
175+
b 3 0,255,0,255,0,255,0,255,0,255 0 b Left b b
176+
cd 5 0,255,0,255,0,255,0,255,0,255 0 cd Left cd cd
177+
";
178+
179+
#[test]
180+
fn parses_count_and_first_token_per_line() {
181+
let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
182+
assert_eq!(u.size(), 3);
183+
assert_eq!(u.id_to_unichar(0), Some("a"));
184+
assert_eq!(u.id_to_unichar(2), Some("cd")); // multi-char unichar token
185+
assert_eq!(u.id_to_unichar(3), None); // out of range
186+
}
187+
188+
#[test]
189+
fn bijection_round_trips() {
190+
let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
191+
for id in 0..u.size() as u32 {
192+
let s = u.id_to_unichar(id).unwrap();
193+
assert_eq!(u.unichar_to_id(s), Some(id), "id {id} must round-trip");
194+
}
195+
assert_eq!(u.unichar_to_id("zzz"), None, "absent unichar -> None");
196+
}
197+
198+
#[test]
199+
fn dump_matches_oracle_line_shape() {
200+
let u = UniCharSet::load_from_str(SAMPLE).expect("valid");
201+
assert_eq!(u.dump(), "0\ta\n1\tb\n2\tcd\n");
202+
}
203+
204+
#[test]
205+
fn errors_are_typed() {
206+
assert_eq!(UniCharSet::load_from_str(""), Err(UniCharSetError::Empty));
207+
assert_eq!(
208+
UniCharSet::load_from_str("notanumber\n"),
209+
Err(UniCharSetError::BadCount)
210+
);
211+
assert_eq!(
212+
UniCharSet::load_from_str("5\na\nb\n"),
213+
Err(UniCharSetError::CountMismatch {
214+
declared: 5,
215+
found: 2
216+
})
217+
);
218+
}
219+
}

0 commit comments

Comments
 (0)