Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d911a1c
docs(plan): plant ocr-canonical-soa-integration-v1.md
AdaWorldAPI Jun 15, 2026
9725503
docs(plan): plant tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI Jun 15, 2026
1e24600
docs(plan): plant tesseract-rs-lstm-recodebeam-v1.md
AdaWorldAPI Jun 15, 2026
09b0b4e
docs(plan): plant tesseract-rs-neural-layout-ocrs-v1.md
AdaWorldAPI Jun 15, 2026
1c9736f
docs(plan): plant tesseract-rs-traineddata-ndarray-v1.md
AdaWorldAPI Jun 15, 2026
6ddbd97
docs(plan): plant tesseract-rs-transcode-master-v1.md
AdaWorldAPI Jun 15, 2026
b021d2c
docs(plan): retire tesseract-rs-traineddata-ndarray-v1.md (v2 superse…
AdaWorldAPI Jun 15, 2026
5d6b51c
docs(plan): retire tesseract-rs-lstm-recodebeam-v1.md (v2 supersession)
AdaWorldAPI Jun 15, 2026
7f85b58
docs(plan): retire tesseract-rs-neural-layout-ocrs-v1.md (v2 superses…
AdaWorldAPI Jun 15, 2026
a324864
docs(plan): v2 — ocr-canonical-soa-integration-v1.md
AdaWorldAPI Jun 15, 2026
357634e
docs(plan): v2 — tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI Jun 15, 2026
2dc636b
docs(plan): v2 — tesseract-rs-layout-transcode-v1.md
AdaWorldAPI Jun 15, 2026
534a9c5
docs(plan): v2 — tesseract-rs-recodebeam-transcode-v1.md
AdaWorldAPI Jun 15, 2026
dbd6386
docs(plan): v2 — tesseract-rs-traineddata-gguf-v1.md
AdaWorldAPI Jun 15, 2026
eeb0e7f
docs(plan): v2 — tesseract-rs-transcode-master-v1.md
AdaWorldAPI Jun 15, 2026
5b7ba97
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
aa0ef21
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
abdbf9d
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
c089518
docs(plan): helix residue = 48-bit (2x ResidueEdge), distinct from 48…
AdaWorldAPI Jun 15, 2026
39a23de
docs(plan): helix residue = phi-spiral endpoint-pair edge (3B/24bit),…
AdaWorldAPI Jun 15, 2026
4c46707
docs(plan): helix residue in BITS (24b/edge, 48b/token); disown bogus…
AdaWorldAPI Jun 15, 2026
df48c87
docs(plan): helix residue = 24-bit golden index (probe #495); helix-4…
AdaWorldAPI Jun 15, 2026
209ea6a
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI Jun 15, 2026
6e316fa
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI Jun 15, 2026
298e8e9
docs(plan): aerial = lance-graph-arm-discovery (Aerial+ codebook-dist…
AdaWorldAPI Jun 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions .claude/plans/ocr-canonical-soa-integration-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# OCR → Canonical SoA Integration v1

> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53.
> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for."
> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`).
> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ).
> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve.

---

## 0. Intent

An OCR token is not a foreign payload that needs a boundary adapter — it **is** a
canonical SoA node. This plan defines the mapping so recognized text lands directly
in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema`
preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by
DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole
point of the splat-native / "one representation, many views" doctrine, applied to OCR.

## 1. OCR token → `NodeRow` mapping (D-OCR-50)

**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`.
- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it.
- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`).
- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin.
→ `local_key()` (trailing 6 B) addresses a token within its line after the trie walk.

**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):**
- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line
neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology.
- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column
parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry).

## 2. OCR class + HHTL address scheme (D-OCR-50)

- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block →
Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`,
`Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space
per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0).
- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and
`value_schema` (the OCR preset, §3).

## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)

The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not
a stored string and not a hash** — it is the *terminal of the perturbation cascade*,
reconstructed exactly like every other node. Text = codebook index + residue.

| Tenant (existing) | OCR role |
|---|---|
| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. |
| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
| `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
| `Plasticity` (u32) | correction history / last-repair stamp |

**Reconstruction (this is the round-trip, and it answers Codex P1):**
`text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode =
the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The
reversibility lives in residue + codebook, which is the architecture's whole point.

**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the
**recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so
the codes themselves are the reversible payload in `Meta`, repaired by the
char-confusion grammar (D-OCR-52). Still a residue, never a hash.

**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so
OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over
{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only;
moves no tenant (canon: tenants never move/reuse).

## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)

The recognizer emits candidates+confidence; repair is the brainstem we already have:
- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m`
confusion table + number/date/currency/table-cell grammars. Repairs orthography on
OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only
genuinely greenfield code; the word-frequency half already exists as
`deepnsm/word_frequency`.)
- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder`
→ `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
(PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous
tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.

Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
provenance → `Meta`/`Plasticity`.
Comment thread
AdaWorldAPI marked this conversation as resolved.

## 5. Persistence + planner (kv-lance / surreal)

- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork).
OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot.
- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST
adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→
repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series
of throughput, AST API for the repair-grammar (compile-time vs JIT grammars).

## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff

The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for
the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port
text AND the resulting `NodeRow` bytes. Because every stage is supposed to be
bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64
locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the
migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and
SIMD numeric exactness. OCR is the best external oracle the substrate has.

## 7. Deliverables

- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class.
- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token
round-trips token→NodeRow→token with no geometry change.
- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility.
- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must
define the row layout before bytes can be golden-diffed).

## 8. Open decisions

- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides).
- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab
sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.)
- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling
`coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.)
69 changes: 69 additions & 0 deletions .claude/plans/soa-centroid-attention-field-synthesis-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# SoA Centroid Attention Field — Unified Synthesis v1

> **Type:** plan (phase-2 marker / co-architecture). Unifies recognition + reasoning + grammar as reads of ONE field.
> **Status:** PLANTED 2026-06-15. Gated on `cycle-coherent-soa-snapshot-v1` (plastic field ⇒ COW writes).
> **Canon:** helix crate (golden-index residue, φ-template); deepnsm; causal-edge (pearl/nars); TEKAMOLO (#495).

---

## 0. The one idea

The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a
centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index)
= each point's perturbation off it = the **query↔key alignment**; the pyramid =
**multi-scale attention** (coarse centroid → fine). The field is *evaluated from the
φ-spiral template, never stored*. Everything below is a **read of this one field at
a different scale** — not separate engines bolted together.

## 1. The reads (each is the same field, different scale)

| Capability | Real crate / source | What it is, as a field read |
|---|---|---|
| **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted |
| **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution |
| **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) |
| **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength |
| **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field |
| **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path |
| **Rule learning (the real "aerial")** | `lance-graph-arm-discovery::aerial` — Aerial+ transcode (arXiv 2504.19354), **autoencoder replaced by integer codebook-distance oracle** (palette256, ρ=0.9973 vs cosine) | mines SPO association rules **float-free / bitwise-deterministic** → `arm_to_truth_u8` → `CausalEdge64` confidence_u8 + i4 mantissa. This IS "learning edges" — the field's codebook distance replaces the f32 autoencoder. |
| **Episodic / coref** | AriGraph (`EpisodicWitness64`) | temporal chain read = the field over witness-time |
| **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word |

## 2. Why this is one object, not a pipeline

VSA bind/bundle/similarity **are** the field operations: bind = perturbation off
centroid, bundle = the pyramid's coarse-level superposition, similarity = field
alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout* of
the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention
masks*. No separate learning machine is needed — the attention field already does
binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN
⊁ VSA for symbol sequences). **`aerial` is the proof in-tree:** Aerial+'s f32
autoencoder is replaced by the integer codebook-distance oracle (the field) and
still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed.
What's missing is only **plasticity** (centroid drift), not a learner.

## 3. Phase-2: make the field plastic (the "learning edges")

Not new tenants — **the field adapts**:
- centroid **drift** (place-centroids move toward corpus density);
- shader **perturbation-gain** adaptation (the pyramid's response sharpens);
- timed by `Plasticity` tenant; coupled by `CausalEdge64` strength (NARS mantissa moves).
Evaluated from the φ-template (not materialized). **Hard dep:** `cycle-coherent-soa-snapshot`
COW — plastic field mutates per cycle; without snapshot it thrashes Lance.

## 4. ONNX combination (operator's point)

The ONNX-shaped recognizer and the field **meet at the query boundary**: ONNX emits
posteriors → the field's golden-index query; field eval + grammar masks + NARS
coupling resolve to the token. So ONNX = the perceptual *encoder into* the field;
the field = everything symbolic/sequential/relational. One substrate, two scales.

## 5. Determinism split (non-negotiable)
- **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here.
- **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot.
Two modes, explicitly separated, or the bit-repro guarantee is lost.

## 6. Open
- **OD-A:** RESOLVED — "aerial" = `lance-graph-arm-discovery::aerial` (Aerial+ rule-mining, codebook-distance oracle, not AriGraph).
- **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.)
- **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored).
87 changes: 87 additions & 0 deletions .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1

> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
> **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped.
> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass.

---

## 0. Intent

Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder,
dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic,
reviewable codegen harness** rather than by hand — so the faithful tier is
auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a
**Rust emission backend built on the `ruff` AST/codegen crates**.

## 1. Why ruff (honest scoping)

`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse
**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the
mature, battle-tested **Rust-side AST → source emission discipline**:
`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting
IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural
invariant checks on a typed AST). We reuse those *patterns and crates* as the
emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is
clang.

```
C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle
```

## 2. The "AST DLL" — D-OCR-40

The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"):
a libclang traversal that dumps the subset we transcode (struct/enum decls, plain
methods, table initializers, fixed-size array walks) as a typed IR — independent of
clang version drift, so the emission step is reproducible. Functions touching
pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are
**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code —
already skipped, per master §3).

## 3. Rust emission via ruff crates — D-OCR-41

A `RustAst` builder consumes the IR and emits idiomatic Rust:
- field-by-field struct/enum transcription (canon: byte layout preserved);
- table/array initializers → `const`/`static` Rust tables;
- the emission goes through ruff's formatter IR so output is deterministic and
diff-stable (re-running codegen produces byte-identical source).
- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct
(no silent re-ordering / re-widening — the same invariant the SoA envelope audit
enforces).

## 4. Diff-gate — D-OCR-42

Every codegen'd module is validated against the FFI oracle:
- behavioral: emitted Rust function vs `libtesseract` function on the same inputs
(e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal;
- structural: `dto_check` confirms each emitted struct's byte image matches the C++
`sizeof`/offset dump.
Codegen output is committed (not generated at build) so reviewers see real Rust;
the harness is re-runnable to prove the commit equals the generator output.

## 5. Module assignment (codegen vs hand vs replace)

| C++ area | Route |
|---|---|
| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
| `textord`/`ccstruct` layout | **CODEGEN → faithful raw-pointer Rust (D-OCR-30)** — intrusive ELIST/CLIST transcribed 1:1, NOT replaced |
| Leptonica (~dozen ops only) | hand-port to image/imageproc (D-OCR-31) |

## 6. Deliverables

- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works.
- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical.
- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle.

## 7. Open decisions

- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by
a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang
JSON dump for v1 (decoupled, reproducible), libclang later if needed.
- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable
`AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.)
Loading