Skip to content

Commit 21f325f

Browse files
committed
unicharset: byte-parity GREEN vs libtesseract — PROBE-OGAR-ADAPTER-UNICHARSET FINDING
leptonica installed in-env (apt-get — an install, not a transcode), so the byte-parity probe RAN and passed. UniCharSet dump vs a C++ UNICHARSET FFI oracle on the real eng.lstm-unicharset: 112/112 byte-identical. The falsifier did its job: the documented-format parser matched 111/112; the oracle named the one real convention it missed — the NULL file-token IS the space unichar (unicharset.cpp:882 remaps "NULL" -> " "). One-line fix (load_from_str maps "NULL" -> " "), re-diff, 0 differences. NOT a Core gap. CONJECTURE -> FINDING for the unicharset adapter: the variable-length bijection rides the content-store tier with no Core gap and is byte-exact with libtesseract. Doctrine flipped (core-first-transcode-doctrine.md falsifier RESULT); EPIPHANIES E-CPP-PARITY-1; plan BYTE-PARITY ACHIEVED. The classid->ClassView->UnifiedStep dispatch wiring is mechanical remainder; the lookups themselves are now proven. +1 test (null_token_maps_to_space); contract lib green; clippy + fmt clean. Co-Authored-By: Claude <noreply@anthropic.com> https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
1 parent e82f202 commit 21f325f

4 files changed

Lines changed: 74 additions & 5 deletions

File tree

.claude/board/EPIPHANIES.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## 2026-06-17 — E-CPP-PARITY-1 — the unicharset adapter is byte-identical to libtesseract; PROBE-OGAR-ADAPTER-UNICHARSET green
2+
3+
**Status:** FINDING (in-env, real trained data). `lance_graph_contract::unicharset::UniCharSet` dumps the `eng.lstm-unicharset` id→unichar table **byte-identical to the C++ `UNICHARSET` FFI oracle, 112/112** (installed libtesseract 5.3.4 + libleptonica 1.82; `examples/unicharset_dump.rs` vs an oracle harness built over the source header + the lib's exported `UNICHARSET::{load_from_file,id_to_unichar}` symbols).
4+
5+
**The falsifier did exactly what a falsifier should.** The documented-format Rust parser (first token per line = unichar, id = position) matched 111/112; the C++ oracle named the one real convention it missed — the `NULL` file-token IS the space unichar (`unicharset.cpp:882` remaps `"NULL"` → `" "`). One-line fix, re-diff, 0 differences. NOT a Core gap.
6+
7+
**The leptonica epiphany:** leptonica is an *install*, not a transcode. It is only a *link* dep of the C++ oracle harness — never in the Rust path (the unicharset path is text parsing, never touches `Pix`). Transcoding leptonica (~250k LOC of pointer-heavy C image-processing, the hand-port category) is the far-off zero-C end-state, NOT a prerequisite to prove the pipeline. The whole "we need the operator's leptonica host" framing collapsed to one `apt-get`.
8+
9+
**Scope (honest):** this proves the unicharset adapter's id↔unichar bijection + content-store tier at byte-parity — the doctrine's designated falsifier (`PROBE-OGAR-ADAPTER-UNICHARSET`), now FINDING. The `classid → ClassView → UnifiedStep` dispatch wiring is mechanical remainder; each future method-body leaf is its own parity check, but the core-first adapter pattern is no longer a conjecture. Cross-ref: `core-first-transcode-doctrine.md` § falsifier RESULT; `transcode-extend-core-probe-v1.md` § BYTE-PARITY ACHIEVED.
10+
111
## 2026-06-17 — E-MATERIALIZED-AWARENESS-2 — the driver wire is live (provenance-only); the four vocabularies are one 2-axis structure
212

313
**Status:** FINDING (shipped on branch `claude/materialize-awareness-f34-loop`): the `cognitive-shader-driver` now runs the `materialize` F→34→F loop + the ndarray HHTL `fork_decision` as a **side analysis** per cycle, recording `MaterializeProvenance` on `ShaderCrystal`. **Provenance-only — the gate/emit/persistence path is byte-for-byte unchanged** (operator decision 2026-06-17). 2 driver tests + 638 contract lib green.

.claude/knowledge/core-first-transcode-doctrine.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,9 @@ reviewed — not an excuse to fatten one adapter.
158158

159159
## The falsifier (CONJECTURE → FINDING gate)
160160

161-
Per `truth-architect` discipline, this doctrine is a CONJECTURE until measured.
162-
The cheapest end-to-end probe:
161+
Per `truth-architect` discipline, this doctrine was a CONJECTURE until measured.
162+
**The byte-parity heart RAN GREEN in-env on 2026-06-17 (result below) — promoted
163+
to FINDING for the unicharset adapter.** The cheapest end-to-end probe:
163164

164165
```
165166
PROBE-OGAR-ADAPTER-UNICHARSET (P0)
@@ -174,8 +175,26 @@ PROBE-OGAR-ADAPTER-UNICHARSET (P0)
174175
building the whole transcode.
175176
```
176177

177-
Until this runs green, "the OGAR Core makes the transcode clean" is a
178-
CONJECTURE. Do NOT scale the adapter approach across modules until it passes.
178+
**RESULT — RAN GREEN (2026-06-17, in-env).** Step 1's byte-parity heart is
179+
confirmed: `lance_graph_contract::unicharset::UniCharSet` (the content-store tier
180+
+ `id_to_unichar` / `unichar_to_id` leaves) is **byte-identical to libtesseract**
181+
on the real `eng.lstm-unicharset` — 112/112 entries, diffed against a C++
182+
`UNICHARSET` FFI oracle (installed `libtesseract` 5.3.4 + `libleptonica` 1.82;
183+
`examples/unicharset_dump.rs` vs the oracle harness). The falsifier did its job:
184+
it found exactly one real convention (`NULL` file-token → `" "` space,
185+
`unicharset.cpp:882`) — a one-line fix, NOT a Core gap. The doctrine's central
186+
worry ("the adapter needs state the SoA tenants can't carry") is **refuted**: the
187+
variable-length bijection rides the content-store tier (`deepnsm::Vocabulary`-
188+
shaped) cleanly, no new node state.
189+
190+
So "the OGAR Core makes the transcode clean" is now a **FINDING** for the
191+
unicharset adapter — the bijection / content-store pattern is validated and may be
192+
scaled. **Honest scope:** steps 2–3 (compose via `classid → ClassView` resolver,
193+
invoke through `UnifiedStep`) are mechanical wiring of a now-proven-correct
194+
adapter; each method-body leaf remains its own byte-parity check, but the pattern
195+
is no longer a conjecture. Leptonica is an *install*, not a transcode — it is only
196+
a link dep of the C++ oracle, never in the Rust path (the unicharset path never
197+
touches `Pix`).
179198

180199
## Anti-patterns this doctrine exists to catch
181200

.claude/plans/transcode-extend-core-probe-v1.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,3 +458,20 @@ side; (4) `diff`. Byte-identical → CONJECTURE→FINDING. Built to the document
458458
loop). The classid→`&UniCharSet` `LazyLock` resolver (the OGAR wiring) and
459459
transcoding leptonica itself (only for the far-off zero-C end-state, a large
460460
hand-port — NOT needed for the probe) both remain explicit follow-ups.
461+
462+
### BYTE-PARITY ACHIEVED — PROBE-OGAR-ADAPTER-UNICHARSET GREEN (2026-06-17, in-env)
463+
464+
leptonica installed in-env (`apt-get` — an install, not a transcode), so the
465+
probe RAN and passed. `UniCharSet` dump vs a C++ `UNICHARSET` FFI oracle on the
466+
real `eng.lstm-unicharset`: **112/112 byte-identical.** The falsifier found
467+
exactly one real convention the documented-format parser missed — the `NULL`
468+
file-token maps to `" "` (space) at runtime (`unicharset.cpp:882`) — a one-line
469+
fix, then 0 diff.
470+
471+
**CONJECTURE → FINDING** for the unicharset adapter: the variable-length
472+
bijection rides the content-store tier with no Core gap and is byte-exact with
473+
libtesseract. The doctrine flips (`core-first-transcode-doctrine.md` § falsifier
474+
RESULT; EPIPHANIES E-CPP-PARITY-1). Remaining is mechanical: the
475+
`classid → &UniCharSet` ClassView resolver + invoke through `UnifiedStep` (the
476+
lookups themselves are now proven). The whole C-FIRST arc —
477+
D → emitter → EXTEND-CORE → byte-parity — is closed in-env.

crates/lance-graph-contract/src/unicharset.rs

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,14 @@ impl UniCharSet {
6868
// entry's position. A unichar repeated in the file keeps its FIRST
6969
// id in `lookup` (matches a forward-scan loader), but `reverse` keeps
7070
// every entry so `id_to_unichar` is exact per position.
71-
let unichar = line.split_whitespace().next().unwrap_or("").to_string();
71+
//
72+
// The one special token: tesseract stores the space unichar as the
73+
// literal `"NULL"` (a real space can't be a whitespace-delimited
74+
// token), and load remaps `"NULL"` -> `" "` (tesseract
75+
// `unicharset.cpp:882`). The byte-parity probe surfaced this as the
76+
// sole id-0 diff against the C++ oracle.
77+
let token = line.split_whitespace().next().unwrap_or("");
78+
let unichar = if token == "NULL" { " " } else { token }.to_string();
7279
let id = u32::try_from(reverse.len()).map_err(|_| UniCharSetError::BadCount)?;
7380
lookup.entry(unichar.clone()).or_insert(id);
7481
reverse.push(unichar);
@@ -201,6 +208,22 @@ cd 5 0,255,0,255,0,255,0,255,0,255 0 cd Left cd cd
201208
assert_eq!(u.dump(), "0\ta\n1\tb\n2\tcd\n");
202209
}
203210

211+
/// Tesseract stores the space unichar as the literal `"NULL"` token; load
212+
/// remaps it to `" "` (`unicharset.cpp:882`). This is the sole id-0
213+
/// discrepancy the byte-parity probe found against the C++ oracle on the
214+
/// real `eng.lstm-unicharset`.
215+
#[test]
216+
fn null_token_maps_to_space() {
217+
let u = UniCharSet::load_from_str("1\nNULL 0 Common 0\n").expect("valid");
218+
assert_eq!(u.id_to_unichar(0), Some(" "));
219+
assert_eq!(u.unichar_to_id(" "), Some(0));
220+
assert_eq!(
221+
u.unichar_to_id("NULL"),
222+
None,
223+
"NULL is the file token, never the runtime unichar"
224+
);
225+
}
226+
204227
#[test]
205228
fn errors_are_typed() {
206229
assert_eq!(UniCharSet::load_from_str(""), Err(UniCharSetError::Empty));

0 commit comments

Comments
 (0)