Skip to content

Commit cfcd4af

Browse files
authored
Merge pull request #497 from AdaWorldAPI/plan/tesseract-rs-transcode
docs(plan): Tesseract → tesseract-rs 1:1 transcode (LSTM hosted via embedanything) — v2
2 parents 2e58e03 + 298e8e9 commit cfcd4af

7 files changed

Lines changed: 483 additions & 0 deletions
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# OCR → Canonical SoA Integration v1
2+
3+
> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53.
4+
> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for."
5+
> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`).
6+
> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ).
7+
> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve.
8+
9+
---
10+
11+
## 0. Intent
12+
13+
An OCR token is not a foreign payload that needs a boundary adapter — it **is** a
14+
canonical SoA node. This plan defines the mapping so recognized text lands directly
15+
in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema`
16+
preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by
17+
DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole
18+
point of the splat-native / "one representation, many views" doctrine, applied to OCR.
19+
20+
## 1. OCR token → `NodeRow` mapping (D-OCR-50)
21+
22+
**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`.
23+
- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it.
24+
- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`).
25+
- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin.
26+
`local_key()` (trailing 6 B) addresses a token within its line after the trie walk.
27+
28+
**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):**
29+
- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line
30+
neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology.
31+
- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column
32+
parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry).
33+
34+
## 2. OCR class + HHTL address scheme (D-OCR-50)
35+
36+
- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block →
37+
Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`,
38+
`Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space
39+
per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0).
40+
- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and
41+
`value_schema` (the OCR preset, §3).
42+
43+
## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)
44+
45+
The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not
46+
a stored string and not a hash** — it is the *terminal of the perturbation cascade*,
47+
reconstructed exactly like every other node. Text = codebook index + residue.
48+
49+
| Tenant (existing) | OCR role |
50+
|---|---|
51+
| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. |
52+
| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
53+
| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
54+
| `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
55+
| `Plasticity` (u32) | correction history / last-repair stamp |
56+
57+
**Reconstruction (this is the round-trip, and it answers Codex P1):**
58+
`text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode =
59+
the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
60+
the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
61+
coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The
62+
reversibility lives in residue + codebook, which is the architecture's whole point.
63+
64+
**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the
65+
**recoder-code residue**`recodebeam` already emits recoder codes, not pixels, so
66+
the codes themselves are the reversible payload in `Meta`, repaired by the
67+
char-confusion grammar (D-OCR-52). Still a residue, never a hash.
68+
69+
**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so
70+
OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over
71+
{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only;
72+
moves no tenant (canon: tenants never move/reuse).
73+
74+
## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)
75+
76+
The recognizer emits candidates+confidence; repair is the brainstem we already have:
77+
- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m`
78+
confusion table + number/date/currency/table-cell grammars. Repairs orthography on
79+
OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only
80+
genuinely greenfield code; the word-frequency half already exists as
81+
`deepnsm/word_frequency`.)
82+
- **Word layer = `deepnsm`:** `vocabulary``codebook``parser`/`pos``encoder`
83+
`similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
84+
- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
85+
(PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous
86+
tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.
87+
88+
Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
89+
provenance → `Meta`/`Plasticity`.
90+
91+
## 5. Persistence + planner (kv-lance / surreal)
92+
93+
- `NodeRowPacket``SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork).
94+
OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot.
95+
- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST
96+
adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→
97+
repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series
98+
of throughput, AST API for the repair-grammar (compile-time vs JIT grammars).
99+
100+
## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff
101+
102+
The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for
103+
the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port
104+
text AND the resulting `NodeRow` bytes. Because every stage is supposed to be
105+
bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64
106+
locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the
107+
migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and
108+
SIMD numeric exactness. OCR is the best external oracle the substrate has.
109+
110+
## 7. Deliverables
111+
112+
- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class.
113+
- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token
114+
round-trips token→NodeRow→token with no geometry change.
115+
- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
116+
OCR-garbage fixture (`69B8`, `rn``m`) is repaired by plausibility.
117+
- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
118+
SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must
119+
define the row layout before bytes can be golden-diffed).
120+
121+
## 8. Open decisions
122+
123+
- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides).
124+
- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab
125+
sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.)
126+
- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling
127+
`coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.)
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# SoA Centroid Attention Field — Unified Synthesis v1
2+
3+
> **Type:** plan (phase-2 marker / co-architecture). Unifies recognition + reasoning + grammar as reads of ONE field.
4+
> **Status:** PLANTED 2026-06-15. Gated on `cycle-coherent-soa-snapshot-v1` (plastic field ⇒ COW writes).
5+
> **Canon:** helix crate (golden-index residue, φ-template); deepnsm; causal-edge (pearl/nars); TEKAMOLO (#495).
6+
7+
---
8+
9+
## 0. The one idea
10+
11+
The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a
12+
centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index)
13+
= each point's perturbation off it = the **query↔key alignment**; the pyramid =
14+
**multi-scale attention** (coarse centroid → fine). The field is *evaluated from the
15+
φ-spiral template, never stored*. Everything below is a **read of this one field at
16+
a different scale** — not separate engines bolted together.
17+
18+
## 1. The reads (each is the same field, different scale)
19+
20+
| Capability | Real crate / source | What it is, as a field read |
21+
|---|---|---|
22+
| **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted |
23+
| **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution |
24+
| **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) |
25+
| **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength |
26+
| **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field |
27+
| **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path |
28+
| **Rule learning (the real "aerial")** | `lance-graph-arm-discovery::aerial` — Aerial+ transcode (arXiv 2504.19354), **autoencoder replaced by integer codebook-distance oracle** (palette256, ρ=0.9973 vs cosine) | mines SPO association rules **float-free / bitwise-deterministic**`arm_to_truth_u8``CausalEdge64` confidence_u8 + i4 mantissa. This IS "learning edges" — the field's codebook distance replaces the f32 autoencoder. |
29+
| **Episodic / coref** | AriGraph (`EpisodicWitness64`) | temporal chain read = the field over witness-time |
30+
| **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word |
31+
32+
## 2. Why this is one object, not a pipeline
33+
34+
VSA bind/bundle/similarity **are** the field operations: bind = perturbation off
35+
centroid, bundle = the pyramid's coarse-level superposition, similarity = field
36+
alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout* of
37+
the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention
38+
masks*. No separate learning machine is needed — the attention field already does
39+
binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN
40+
⊁ VSA for symbol sequences). **`aerial` is the proof in-tree:** Aerial+'s f32
41+
autoencoder is replaced by the integer codebook-distance oracle (the field) and
42+
still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed.
43+
What's missing is only **plasticity** (centroid drift), not a learner.
44+
45+
## 3. Phase-2: make the field plastic (the "learning edges")
46+
47+
Not new tenants — **the field adapts**:
48+
- centroid **drift** (place-centroids move toward corpus density);
49+
- shader **perturbation-gain** adaptation (the pyramid's response sharpens);
50+
- timed by `Plasticity` tenant; coupled by `CausalEdge64` strength (NARS mantissa moves).
51+
Evaluated from the φ-template (not materialized). **Hard dep:** `cycle-coherent-soa-snapshot`
52+
COW — plastic field mutates per cycle; without snapshot it thrashes Lance.
53+
54+
## 4. ONNX combination (operator's point)
55+
56+
The ONNX-shaped recognizer and the field **meet at the query boundary**: ONNX emits
57+
posteriors → the field's golden-index query; field eval + grammar masks + NARS
58+
coupling resolve to the token. So ONNX = the perceptual *encoder into* the field;
59+
the field = everything symbolic/sequential/relational. One substrate, two scales.
60+
61+
## 5. Determinism split (non-negotiable)
62+
- **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here.
63+
- **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot.
64+
Two modes, explicitly separated, or the bit-repro guarantee is lost.
65+
66+
## 6. Open
67+
- **OD-A:** RESOLVED — "aerial" = `lance-graph-arm-discovery::aerial` (Aerial+ rule-mining, codebook-distance oracle, not AriGraph).
68+
- **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.)
69+
- **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored).
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1
2+
3+
> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
4+
> **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped.
5+
> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
6+
> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
7+
> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass.
8+
9+
---
10+
11+
## 0. Intent
12+
13+
Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder,
14+
dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic,
15+
reviewable codegen harness** rather than by hand — so the faithful tier is
16+
auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a
17+
**Rust emission backend built on the `ruff` AST/codegen crates**.
18+
19+
## 1. Why ruff (honest scoping)
20+
21+
`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse
22+
**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the
23+
mature, battle-tested **Rust-side AST → source emission discipline**:
24+
`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting
25+
IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural
26+
invariant checks on a typed AST). We reuse those *patterns and crates* as the
27+
emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is
28+
clang.
29+
30+
```
31+
C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
32+
──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle
33+
```
34+
35+
## 2. The "AST DLL" — D-OCR-40
36+
37+
The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"):
38+
a libclang traversal that dumps the subset we transcode (struct/enum decls, plain
39+
methods, table initializers, fixed-size array walks) as a typed IR — independent of
40+
clang version drift, so the emission step is reproducible. Functions touching
41+
pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are
42+
**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code —
43+
already skipped, per master §3).
44+
45+
## 3. Rust emission via ruff crates — D-OCR-41
46+
47+
A `RustAst` builder consumes the IR and emits idiomatic Rust:
48+
- field-by-field struct/enum transcription (canon: byte layout preserved);
49+
- table/array initializers → `const`/`static` Rust tables;
50+
- the emission goes through ruff's formatter IR so output is deterministic and
51+
diff-stable (re-running codegen produces byte-identical source).
52+
- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct
53+
(no silent re-ordering / re-widening — the same invariant the SoA envelope audit
54+
enforces).
55+
56+
## 4. Diff-gate — D-OCR-42
57+
58+
Every codegen'd module is validated against the FFI oracle:
59+
- behavioral: emitted Rust function vs `libtesseract` function on the same inputs
60+
(e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal;
61+
- structural: `dto_check` confirms each emitted struct's byte image matches the C++
62+
`sizeof`/offset dump.
63+
Codegen output is committed (not generated at build) so reviewers see real Rust;
64+
the harness is re-runnable to prove the commit equals the generator output.
65+
66+
## 5. Module assignment (codegen vs hand vs replace)
67+
68+
| C++ area | Route |
69+
|---|---|
70+
| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
71+
| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
72+
| `textord`/`ccstruct` layout | **CODEGEN → faithful raw-pointer Rust (D-OCR-30)** — intrusive ELIST/CLIST transcribed 1:1, NOT replaced |
73+
| Leptonica (~dozen ops only) | hand-port to image/imageproc (D-OCR-31) |
74+
75+
## 6. Deliverables
76+
77+
- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works.
78+
- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical.
79+
- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle.
80+
81+
## 7. Open decisions
82+
83+
- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by
84+
a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang
85+
JSON dump for v1 (decoupled, reproducible), libclang later if needed.
86+
- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable
87+
`AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.)

0 commit comments

Comments
 (0)