Skip to content

Commit c8cb95c

Browse files
committed
contract(unicharset): transcode UNICHARSET other_case (case pair), byte-parity 112/112
Add get_other_case + dump_other_case to UniCharSet, backed by other_cases: Vec<i32>. other_case is the case-paired unichar id ('C' -> 'c'), parsed as the token right after the script and clamped at load exactly as unicharset.cpp:901: a parsed value not less than size -- and the absent-column default (unicharset.cpp:813, = size) -- folds to the id itself. get_other_case mirrors unicharset.h:703 (out-of-range id -> INVALID_UNICHAR_ID -1). Byte-identical 112/112 vs tesseract's own get_other_case on real eng.lstm-unicharset (self-validating oracle, other_case mode; 60/112 self, 52 real pairs e.g. C->c). This is the last field cleanly reachable by token-offset; direction/mirror/bbox sit after other_case and need the full multi-tier column parser (next leaf). Fifth leaf of PROBE-OGAR-ADAPTER-UNICHARSET. - +4 unicharset tests (23 total); clippy -D warnings + fmt clean (scoped -p) - examples/unicharset_dump.rs gains an `other_case` mode (reproduces the diff) - board: EPIPHANIES E-CPP-PARITY-5; LATEST_STATE branch-work + D-UNICHARSET-OTHERCASE Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
1 parent 796f69b commit c8cb95c

4 files changed

Lines changed: 118 additions & 3 deletions

File tree

.claude/board/EPIPHANIES.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,20 @@ The 5+3 council's headline was "five crates linked into one binary with ZERO run
7373
**Next edges:** Grid→NodeRow over a REAL Spain fixture (E1 acceptance gate); the kanban loop (D2: `LanceVersionScheduler` → `KanbanMove` → SoA write → Lance commit).
7474

7575
Cross-ref: PR #555 INTEGRATION_PLAN D1; battle-test plan probes A1/B-series; STATUS_BOARD symbiont-golden-image-harness; `crates/symbiont/src/bridge.rs`.
76+
## 2026-06-20 — E-CPP-PARITY-5 — the UNICHARSET `other_case` (case-pair id, size-clamped) is byte-identical to libtesseract; the fifth leaf through PROBE-OGAR-ADAPTER-UNICHARSET
77+
78+
**Status:** FINDING (in-env, real trained data). `lance_graph_contract::unicharset::UniCharSet::get_other_case` dumps the `eng.lstm-unicharset` per-id case-pair ids **byte-identical to tesseract's own `UNICHARSET::get_other_case`, 112/112** (same self-validating oracle, `other_case` mode). The fifth proven adapter surface; the second per-id integer field (after script id) and the first with a load-time **clamp**.
79+
80+
**The transcode detail — a size-clamp with a size-valued default.** `other_case` is the case-paired unichar's id (`'C'` → the id of `'c'`), or the id itself when unpaired. Load clamps it (`unicharset.cpp:901`): `set_other_case(id, (other_case < size) ? other_case : id)` — a parsed value `>= size`, AND the absent-column default (`unicharset.cpp:813`, initialized to `size`), both fold to the id itself. So "no pair" and "out-of-range pair" collapse to self. `get_other_case` (`unicharset.h:703`) returns `INVALID_UNICHAR_ID` (-1) for an out-of-range id — distinct from the `null_sid_`-style 0 of `get_script`. The oracle confirmed 60/112 ids are self, 52 carry a real pair (e.g. id 3 `C`→87 `c`, id 4 `H`→97 `h`).
81+
82+
**Column position without the full tier-parser (again).** `other_case` is the token immediately after the script token in every tier that carries it — so `other_case = token[4] if token[2] has a comma else token[3]` (one token past the script extractor), proven across eng's mixed tiers (tier-5 id 0 with no CSV: `NULL 0 Common 0` → other_case `0`; tier-1 ids with CSV). The remaining columns (direction/mirror/bbox/stats) sit *after* other_case and genuinely need the multi-tier fallback to place — that is the next, larger leaf. This is the last field cleanly reachable by token-offset alone.
83+
84+
**Pattern holds (E-CPP-KEYSTONE-1).** +1 accessor + clamp + one `diff`, no new architecture, no Core gap. +4 contract tests (23 unicharset total); consumed by `tesseract-core::CharSet::get_other_case` (+1 boundary test, 6/6). Reproducible via the committed `examples/unicharset_dump.rs other_case`. Measure-before-assert held: the oracle's `other_case` dump defined the spec (incl. the 60-self / 52-pair split) before the Rust was written.
85+
86+
Cross-ref: `E-CPP-PARITY-1/2/3/4` (the prior four leaves), `E-CPP-KEYSTONE-1`, `.claude/knowledge/core-first-transcode-doctrine.md`. Branch `claude/happy-hamilton-0azlw4`, lance-graph + tesseract-rs.
87+
88+
---
89+
7690
## 2026-06-20 — E-CPP-PARITY-4 — the UNICHARSET script table (`get_script` + the interned `add_script` table) is byte-identical to libtesseract; the fourth leaf through PROBE-OGAR-ADAPTER-UNICHARSET, and the first to transcode an INTERNING side-table
7791

7892
**Status:** FINDING (in-env, real trained data). `lance_graph_contract::unicharset::UniCharSet::get_script` dumps the `eng.lstm-unicharset` per-id script ids **byte-identical to tesseract's own `UNICHARSET::get_script`, 112/112** — verified by the same self-validating oracle (bijection half = proven 112/112 layout check, then `./uniprops_oracle … script` diffs 0). The fourth proven adapter surface (after id↔unichar, properties, and the UNICHAR codec).

.claude/board/LATEST_STATE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616

1717
---
1818

19+
> **2026-06-20 — branch work (`claude/happy-hamilton-0azlw4`)** — **UNICHARSET `other_case` transcoded + byte-parity proven (E-CPP-PARITY-5), the fifth leaf.** `UniCharSet` now parses the case-pair id (the token right after the script) into `other_cases: Vec<i32>`, applying the load-time clamp (`unicharset.cpp:901`: a value `>= size`, incl. the absent default, folds to the id itself). Exposes `get_other_case` + `dump_other_case`, mirroring `unicharset.h:703` (out-of-range id → `INVALID_UNICHAR_ID` -1). **Byte-identical 112/112** on real `eng.lstm-unicharset` vs tesseract's own `get_other_case` (self-validating oracle, `other_case` mode; 60/112 self, 52 real pairs, e.g. `C`→`c`). Last field cleanly reachable by token-offset; direction/mirror/bbox need the multi-tier parser (next, larger leaf). Additive, zero-dep; +4 contract tests (23 unicharset total), clippy `-D warnings` + fmt clean; reproducible via `examples/unicharset_dump.rs other_case`. Consumed by `tesseract-core::CharSet::get_other_case` (+1 boundary test, 6/6). No Core gap. EPIPHANIES `E-CPP-PARITY-5`.
20+
>
1921
> **2026-06-20 — branch work (`claude/happy-hamilton-0azlw4`)** — **UNICHARSET script table transcoded + byte-parity proven (E-CPP-PARITY-4), the fourth leaf — first to transcode an INTERNING side-table.** `UniCharSet` now parses the per-line script name (the token after the optional bbox/stats CSV), interns it via an `add_script`-equivalent (`unicharset.cpp:1063`, insertion-order dedup) into `scripts: Vec<String>` with `null_script` ("NULL") seeded at sid 0 (the `unichar_insert` set_script, `unicharset.cpp:680`; so `null_sid_ == 0` always), and stores `script_ids: Vec<i32>`. Exposes `get_script` / `get_script_table_size` / `script_from_script_id` / `script_of` / `dump_script`, mirroring `unicharset.h:681` (out-of-range → `null_sid_` 0). **Byte-identical 112/112** on real `eng.lstm-unicharset` vs tesseract's own `get_script` (same self-validating oracle, `script` mode; oracle table = `["NULL","Common","Latin"]` confirmed empirically before writing the Rust). Mixed-tier safe (eng id 0 is tier-5 no-CSV, others tier-1 CSV). Additive, zero-dep; +4 contract tests (19 unicharset total), clippy `-D warnings` + fmt clean; reproducible via `examples/unicharset_dump.rs script`. Consumed by `tesseract-core::CharSet::{get_script,script_of}` (+1 boundary test, 5/5). No Core gap. EPIPHANIES `E-CPP-PARITY-4`. Next leaf: the full column tier-parser (unlocks other_case/mirror/direction/bbox).
2022
>
2123
> **2026-06-20 — branch work (`claude/happy-hamilton-0azlw4`)** — **UNICHARSET property accessors transcoded + byte-parity proven (E-CPP-PARITY-3), the third leaf through PROBE-OGAR-ADAPTER-UNICHARSET.** `lance_graph_contract::unicharset::UniCharSet` now parses the per-line hex property bitmask (`unicharset.cpp:824`) into a `props: Vec<u8>` and exposes `get_is{alpha,lower,upper,digit,punctuation}` + `get_isngram` + `dump_properties()`, mirroring the C++ inline accessors (`unicharset.h:497+`; out-of-range id → `false`, `INVALID_UNICHAR_ID` semantics). **Byte-identical 112/112** on real `eng.lstm-unicharset` vs tesseract's own `get_is*` via a **self-validating** oracle: the same harness dumps the id↔unichar bijection (proven 112/112 reference, E-CPP-PARITY-1) AND the properties — the bijection half diffing 0 proves the 5.5.0-header/5.3.4-lib layout is sound, making the property diff (also 0) trustworthy despite the version skew. Additive, zero-dep; +5 contract tests (15 unicharset total), clippy `-D warnings` + fmt clean. Consumed by `tesseract-core` as `CharSet::get_is*` (+1 consumer-boundary test, 4/4 green). Incidental: rustfmt-1.9.0 normalized two pre-existing test-assert wraps in `class_view.rs` (whitespace-only). No Core gap, no adapter state (per `E-CPP-KEYSTONE-1` "repetition of a validated pattern"). EPIPHANIES `E-CPP-PARITY-3`.
@@ -110,6 +112,8 @@
110112

111113
> **2026-06-18 — ADDED (D-DO-ARM-1, the OGAR DO arm)**: `lance_graph_contract::action::{ActionState, StateGuard, ActionDef, ClassActions, actions_for, effective_actions, ActionInvocation}` — the Perdurant DO arm completing the OGAR IR (the action-axis sibling of `codegen_manifest`'s `MethodSig`/THINK). Both the 4-agent `sale_order` AR→DO probe (runtime-archaeologist) AND the merged cross-repo PR survey (ruff/OGAR/lance-graph/openproject/tesseract) agreed this was the ONE missing wire: the THINK arm (`classid → ClassView`, `has_function → MethodSig`) is converged + merged; the DO-arm `ActionInvocation`/`ActionDef` type was ABSENT. **`ActionDef`** (static, `const`-constructible, all `&'static`/`Copy`): `predicate` (= harvested `has_function` method), `object_class` (classid), `exec` (`ExecTarget` incl `SurrealQl`), `guard` (`StateGuard` = KausalSpec field==value), `required_role` (RBAC), `overrides` (OGAR `classid→ClassView` inheritance). **`ClassActions`+`actions_for`** (zero-fallback) mirror `ClassMethods`/`methods_for`. **`effective_actions(parent, child)`** = OGAR inheritance on the action axis (child overrides parent by predicate). **`ActionInvocation`** (dynamic, `Copy`): lifecycle `ActionState{Pending→Committed|Failed|Cancelled}` (sticky terminals), S2.5 `cycle` stamp, idempotency/trace keys, HLC `emitted_at_millis`. **`ActionInvocation::commit(def, actor, impact, now)`** is the gated egress — RBAC FIRST (`auth::ActorContext` must hold `required_role` or be admin → else `Failed`), THEN MUL impact (`mul::GateDecision`: `Flow→Committed`+stamped, `Hold→`Pending/escalate, `Block→Cancelled`). This IS "commit to the external consumer (odoo/openproject/woa/tesseract) after the cycle decides sound." Dispatched via `UnifiedStep`/`ExecTarget`, NOT a per-crate endpoint. Additive, zero-dep. +5 tests green. Consumer reference: `docs/OGAR_CONSUMER_API.md`. Branch `claude/soa-write-deinterlace-inc2`.
112114

115+
> **2026-06-20 — ADDED (D-UNICHARSET-OTHERCASE, the case-pair leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained `get_other_case(id) -> i32` + `dump_other_case()`, backed by `other_cases: Vec<i32>`. The case-paired unichar id (`'C'``'c'`), parsed as the token after the script and clamped at load (`unicharset.cpp:901`: a value `>= size`, and the absent default = size, fold to the id itself). Out-of-range id → `INVALID_UNICHAR_ID` -1 (`unicharset.h:703`). **Byte-identical 112/112** vs tesseract's own `get_other_case` on real `eng.lstm-unicharset` (self-validating oracle `other_case` mode; 60 self / 52 pairs). Additive, zero-dep. +4 tests (23 unicharset total). Consumed by `tesseract-core::CharSet::get_other_case`. EPIPHANIES `E-CPP-PARITY-5`; fifth leaf of `PROBE-OGAR-ADAPTER-UNICHARSET`; the last field reachable by token-offset (direction/mirror/bbox need the multi-tier parser). Branch `claude/happy-hamilton-0azlw4`.
116+
113117
> **2026-06-20 — ADDED (D-UNICHARSET-SCRIPT, the script-table leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained `get_script(id) -> i32` / `get_script_table_size()` / `script_from_script_id(sid) -> Option<&str>` / `script_of(id) -> Option<&str>` / `dump_script()`, backed by new `script_ids: Vec<i32>` + an interned `scripts: Vec<String>`. The first leaf to transcode an **interning side-table** (`add_script`, `unicharset.cpp:1063`): `null_script` "NULL" seeded at sid 0 (the `unichar_insert` set_script, `unicharset.cpp:680` → `null_sid_ == 0`), real scripts intern from 1 in id order. Script name = token after the optional bbox/stats CSV (mixed-tier safe). Out-of-range → `null_sid_` 0 (`unicharset.h:681`). **Byte-identical 112/112** vs tesseract's own `get_script` on real `eng.lstm-unicharset` (self-validating oracle `script` mode; table `["NULL","Common","Latin"]`). Additive, zero-dep, behaviour-preserving on the bijection. +4 tests (19 unicharset total). Consumed by `tesseract-core::CharSet::{get_script,script_of}`. EPIPHANIES `E-CPP-PARITY-4`; fourth leaf of `PROBE-OGAR-ADAPTER-UNICHARSET`. Branch `claude/happy-hamilton-0azlw4`.
114118

115119
> **2026-06-20 — ADDED (D-UNICHARSET-PROPS, the property-accessor leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained the character-category surface `get_isalpha` / `get_islower` / `get_isupper` / `get_isdigit` / `get_ispunctuation` / `get_isngram` + `dump_properties()`, backed by a new `props: Vec<u8>` parsed from the per-line hex bitmask (`unicharset.cpp:824`; masked to `ISALPHA=0x1 ISLOWER=0x2 ISUPPER=0x4 ISDIGIT=0x8 ISPUNCTUATION=0x10`). Accessors mirror the C++ inline guard (`unicharset.h:497+`): out-of-range id → `false` (`INVALID_UNICHAR_ID`); `get_isngram` is always-false on the plain-table load path (`unicharset.cpp:893`). **Byte-identical 112/112** vs tesseract's own `get_is*` on real `eng.lstm-unicharset` (self-validating oracle: bijection half cross-checks the 5.5.0-header/5.3.4-lib layout, then the property half diffs 0). Additive, zero-dep, behaviour-preserving on the existing id↔unichar bijection (lenient default-0 for a missing/!hex token). +5 tests (15 unicharset total). Consumed by `tesseract-core::CharSet::get_is*`. EPIPHANIES `E-CPP-PARITY-3`; the third leaf of `PROBE-OGAR-ADAPTER-UNICHARSET` (after D-UNICHARSET-1 + D-UNICHAR-1). Branch `claude/happy-hamilton-0azlw4`.

crates/lance-graph-contract/examples/unicharset_dump.rs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
//! Dump a `.unicharset`'s id→unichar table (default), its per-id property bits
2-
//! (`properties` mode), or its per-id script ids (`script` mode) — the Rust side
3-
//! of the byte-parity probe `PROBE-OGAR-ADAPTER-UNICHARSET`.
2+
//! (`properties` mode), its per-id script ids (`script` mode), or its per-id
3+
//! case-pair ids (`other_case` mode) — the Rust side of the byte-parity probe
4+
//! `PROBE-OGAR-ADAPTER-UNICHARSET`.
45
//!
56
//! ```sh
67
//! # on a box with libtesseract + libleptonica installed:
@@ -32,7 +33,7 @@ use lance_graph_contract::unicharset::UniCharSet;
3233

3334
fn main() -> ExitCode {
3435
let Some(path) = std::env::args().nth(1) else {
35-
eprintln!("usage: unicharset_dump <path/to/eng.unicharset> [properties|script]");
36+
eprintln!("usage: unicharset_dump <path/to/eng.unicharset> [properties|script|other_case]");
3637
return ExitCode::FAILURE;
3738
};
3839
let mode = std::env::args().nth(2).unwrap_or_default();
@@ -41,6 +42,7 @@ fn main() -> ExitCode {
4142
match mode.as_str() {
4243
"properties" => print!("{}", unicharset.dump_properties()),
4344
"script" => print!("{}", unicharset.dump_script()),
45+
"other_case" => print!("{}", unicharset.dump_other_case()),
4446
_ => print!("{}", unicharset.dump()),
4547
}
4648
ExitCode::SUCCESS

0 commit comments

Comments
 (0)