AdaWorldAPI · AdaWorldAPI · Jun 16, 2026 · Jun 15, 2026 · Jun 15, 2026 · Jun 15, 2026
diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -0,0 +1,119 @@
+# OCR → Canonical SoA Integration v1
+
+> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53.
+> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for."
+> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`).
+> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ).
+> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve.
+
+---
+
+## 0. Intent
+
+An OCR token is not a foreign payload that needs a boundary adapter — it **is** a
+canonical SoA node. This plan defines the mapping so recognized text lands directly
+in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema`
+preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by
+DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole
+point of the splat-native / "one representation, many views" doctrine, applied to OCR.
+
+## 1. OCR token → `NodeRow` mapping (D-OCR-50)
+
+**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`.
+- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it.
+- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`).
+- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin.
+  → `local_key()` (trailing 6 B) addresses a token within its line after the trie walk.
+
+**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):**
+- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line
+  neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology.
+- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column
+  parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry).
+
+## 2. OCR class + HHTL address scheme (D-OCR-50)
+
+- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block →
+  Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`,
+  `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space
+  per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0).
+- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and
+  `value_schema` (the OCR preset, §3).
+
+## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)
+
+The 480-byte value slab already carves into `VALUE_TENANTS`. OCR **rides existing
+tenants** — no new tenant for the POC:
+
+| Tenant (existing) | OCR use |
+|---|---|
+| `Fingerprint` (32 B / 256-bit) | glyph/line identity print (DeepNSM `encoder` XOR-bind/bundle of the crop) |
+| `TurbovecResidue` (16 B, PQ) | glyph embedding → CAKES nearest-valid-token search |
+| `HelixResidue` (48 B) | orthogonal residue: per-token deviation from class centroid (confidence-as-residue) |
+| `Meta` (u64) | packed confidence + NSM-repair flags + token-subtype bits |
+| `EntityType` (u16) | OCR token class discriminator (Word/Number/Date/Glyph/TableCell) |
+| `Plasticity` (u32) | correction history / last-repair stamp |
+
+→ define `ValueSchema::Ocr` (or select `Cognitive` if its mask already covers the
+above) as a `FieldMask` over those `ValueTenant` positions. Selection only — it
+carves *within* the slab, moves nothing (canon: tenants never move/reuse).
+
+**OD-1 (deferred):** a dedicated `ValueTenant::OcrEvidence` (bbox `[f16;4]` +
+per-char confidence + top-k recodebeam candidates) is the clean home for
+recognizer evidence. Adding a tenant is canon-significant, so the POC packs a
+compressed form into `Meta`+`HelixResidue` and defers the dedicated tenant to a
+follow-up once the evidence shape is stable (needs D-OCR-21 `lstm_choice_mode`).
+
+## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)
+
+The recognizer emits candidates+confidence; repair is the brainstem we already have:
+- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m`
+  confusion table + number/date/currency/table-cell grammars. Repairs orthography on
+  OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only
+  genuinely greenfield code; the word-frequency half already exists as
+  `deepnsm/word_frequency`.)
+- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder`
+  → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
+- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
+  (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA flags anomalous
+  tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.
+
+Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
+provenance → `Meta`/`Plasticity`.
+
+## 5. Persistence + planner (kv-lance / surreal)
+
+- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork).
+  OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot.
+- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST
+  adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→
+  repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series
+  of throughput, AST API for the repair-grammar (compile-time vs JIT grammars).
+
+## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff
+
+The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for
+the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port
+text AND the resulting `NodeRow` bytes. Because every stage is supposed to be
+bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64
+locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the
+migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and
+SIMD numeric exactness. OCR is the best external oracle the substrate has.
+
+## 7. Deliverables
+
+- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class.
+- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token
+  round-trips token→NodeRow→token with no geometry change.
+- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
+  OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility.
+- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
+  SoA migration suite.
+
+## 8. Open decisions
+
+- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides).
+- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab
+  sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.)
+- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling
+  `coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.)
diff --git a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
@@ -0,0 +1,86 @@
+# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1
+
+> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
+> **Status:** PLANTED 2026-06-15 — design only.
+> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
+> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
+> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is hand-ported or replaced.
+
+---
+
+## 0. Intent
+
+Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder,
+dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic,
+reviewable codegen harness** rather than by hand — so the faithful tier is
+auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a
+**Rust emission backend built on the `ruff` AST/codegen crates**.
+
+## 1. Why ruff (honest scoping)
+
+`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse
+**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the
+mature, battle-tested **Rust-side AST → source emission discipline**:
+`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting
+IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural
+invariant checks on a typed AST). We reuse those *patterns and crates* as the
+emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is
+clang.
+
+```
+C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
+   ──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle
+```
+
+## 2. The "AST DLL" — D-OCR-40
+
+The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"):
+a libclang traversal that dumps the subset we transcode (struct/enum decls, plain
+methods, table initializers, fixed-size array walks) as a typed IR — independent of
+clang version drift, so the emission step is reproducible. Functions touching
+pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are
+**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code —
+already skipped, per master §3).
+
+## 3. Rust emission via ruff crates — D-OCR-41
+
+A `RustAst` builder consumes the IR and emits idiomatic Rust:
+- field-by-field struct/enum transcription (canon: byte layout preserved);
+- table/array initializers → `const`/`static` Rust tables;
+- the emission goes through ruff's formatter IR so output is deterministic and
+  diff-stable (re-running codegen produces byte-identical source).
+- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct
+  (no silent re-ordering / re-widening — the same invariant the SoA envelope audit
+  enforces).
+
+## 4. Diff-gate — D-OCR-42
+
+Every codegen'd module is validated against the FFI oracle:
+- behavioral: emitted Rust function vs `libtesseract` function on the same inputs
+  (e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal;
+- structural: `dto_check` confirms each emitted struct's byte image matches the C++
+  `sizeof`/offset dump.
+Codegen output is committed (not generated at build) so reviewers see real Rust;
+the harness is re-runnable to prove the commit equals the generator output.
+
+## 5. Module assignment (codegen vs hand vs replace)
+
+| C++ area | Route |
+|---|---|
+| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
+| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
+| `textord`/`ccstruct` layout, Leptonica | **REPLACE** (ocrs / minimal imageproc) — never enters the harness |
+
+## 6. Deliverables
+
+- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works.
+- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical.
+- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle.
+
+## 7. Open decisions
+
+- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by
+  a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang
+  JSON dump for v1 (decoupled, reproducible), libclang later if needed.
+- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable
+  `AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.)
diff --git a/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md b/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md
@@ -0,0 +1,79 @@
+# tesseract-rs — LSTM Forward + recodebeam Decoder v1
+
+> **Type:** plan (sub-plan). Deliverables D-OCR-20/21/22.
+> **Status:** PLANTED 2026-06-15 — design only.
+> **Front:** post-#496. Forward pass targets `ndarray` (SIMD/BLAS/CLAM provider). Oracle = `tesseract-rs` FFI fork.
+> **Canon anchors:** master §4; `ndarray` SIMD kernels; bit-reproducibility doctrine (DeepNSM "bit-reproducible", envelope version-stamp).
+> **Skip-by-rule:** no legacy matcher; no layout. Input is a line crop, output is text + per-step posteriors.
+
+---
+
+## 0. Intent
+
+Run the hydrated LSTM (D-OCR-11) forward on `ndarray` to produce per-timestep
+class posteriors, then decode them with a faithful `recodebeam` (dictionary-aware,
+CTC-style) to text — **byte-identical to Tesseract** on fixed line crops. This is
+the tier that makes "1:1 Tesseract" provable; everything else is plumbing.
+
+## 1. Forward pass (`tesseract-rs/src/lstm/`, on ndarray) — D-OCR-20
+
+Faithful transcode of `lstm/` numerics. Each maps to ndarray ops:
+
+| Tesseract unit | ndarray realization | Exactness note |
+|---|---|---|
+| `WeightMatrix::MatrixDotVector` (int8) | int8 GEMV via ndarray SIMD kernel | **accumulation order + rounding must match** (D-OCR-22) |
+| `FullyConnected` | matmul + bias + activation LUT | activation table must be the same fixed-point LUT |
+| `LSTM` cell (gates i/f/o/g, peephole) | elementwise on ndarray slices | sigmoid/tanh LUTs identical to C++ |
+| `Convolve` / `Maxpool` | im2col + GEMM / window-max | stride/pad identical |
+| `Softmax` / `LogSoftmax` | row softmax | only at the output; feeds the beam |
+
+Float path is straightforward. **The int8 path is where silent drift lives** — it
+is the whole point of D-OCR-22.
+
+## 2. recodebeam decoder (`tesseract-rs/src/recodebeam.rs`) — D-OCR-21
+
+Hand-port (NOT codegen): tie-breaking, normalization, and dawg interaction are
+under-documented and behaviorally subtle.
+
+- Beam over the **recoder** codes (not raw unichars): the `RecodeBeamSearch`
+  maintains dawg-constrained and unconstrained beams; final path picks per
+  Tesseract's certainty/rating rule.
+- DAWG dictionary (`dict/{dawg,trie,permdawg}`) — **codegen-amenable** node-array
+  walks; the *interaction* with the beam is hand-ported.
+- Output: best text + per-token rating/certainty → becomes per-token confidence at
+  the emit stage (master §1).
+
+## 3. int8-SIMD numeric exactness conformance — D-OCR-22
+
+The conformance contract that earns "1:1":
+
+1. Pin the int8 GEMV accumulation order to Tesseract's (block/tile order matters).
+2. Match the fixed-point rounding mode of `IntSimdMatrix` (AVX2/512/NEON variants
+   reduce in a defined order — replicate it, do not "improve" it).
+3. Identical activation LUTs (sigmoid/tanh/softmax) — copy the tables, not the
+   formulae.
+4. Conformance harness: feed N line crops, compare per-timestep argmax AND the
+   full posterior (within 0 ULP for int8) against an FFI dump from the oracle.
+
+## 4. Ground-truth oracle
+
+The `AdaWorldAPI/tesseract-rs` FFI fork (thin bindings, `src/{lib,page_seg_mode}.rs`)
+is built **only** as the oracle: it runs real `libtesseract` to dump (a) per-matrix
+weights, (b) per-timestep posteriors, (c) final decoded text for the same crops.
+The Rust port is diffed against these. The oracle is a dev/test dependency, never a
+runtime path, and the lone place the Leptonica C fork is compiled.
+
+## 5. Deliverables
+
+- **D-OCR-20:** forward pass on ndarray reproduces C++ per-timestep posteriors
+  (float path 1:1; int8 path within the D-OCR-22 contract) on a 1k-crop set.
+- **D-OCR-21:** `recodebeam` + DAWG reproduces C++ decoded text byte-identical on
+  the same set.
+- **D-OCR-22:** int8 conformance harness green on ≥ 10k crops across 2+ languages.
+
+## 6. Open decisions
+
+- **OD-20a:** target one SIMD width first (AVX2) for exactness, then NEON/AVX-512;
+  or define the scalar reference as canonical and treat SIMD as "must equal scalar"?
+- **OD-21a:** support Tesseract's `lstm_choice_mode` (top-k per timestep) now — it
+  feeds the OCR `ValueTenant` top-k candidates (ocr-soa-integration OD-1) — or later?