Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d911a1c
docs(plan): plant ocr-canonical-soa-integration-v1.md
AdaWorldAPI Jun 15, 2026
9725503
docs(plan): plant tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI Jun 15, 2026
1e24600
docs(plan): plant tesseract-rs-lstm-recodebeam-v1.md
AdaWorldAPI Jun 15, 2026
09b0b4e
docs(plan): plant tesseract-rs-neural-layout-ocrs-v1.md
AdaWorldAPI Jun 15, 2026
1c9736f
docs(plan): plant tesseract-rs-traineddata-ndarray-v1.md
AdaWorldAPI Jun 15, 2026
6ddbd97
docs(plan): plant tesseract-rs-transcode-master-v1.md
AdaWorldAPI Jun 15, 2026
b021d2c
docs(plan): retire tesseract-rs-traineddata-ndarray-v1.md (v2 superse…
AdaWorldAPI Jun 15, 2026
5d6b51c
docs(plan): retire tesseract-rs-lstm-recodebeam-v1.md (v2 supersession)
AdaWorldAPI Jun 15, 2026
7f85b58
docs(plan): retire tesseract-rs-neural-layout-ocrs-v1.md (v2 superses…
AdaWorldAPI Jun 15, 2026
a324864
docs(plan): v2 — ocr-canonical-soa-integration-v1.md
AdaWorldAPI Jun 15, 2026
357634e
docs(plan): v2 — tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI Jun 15, 2026
2dc636b
docs(plan): v2 — tesseract-rs-layout-transcode-v1.md
AdaWorldAPI Jun 15, 2026
534a9c5
docs(plan): v2 — tesseract-rs-recodebeam-transcode-v1.md
AdaWorldAPI Jun 15, 2026
dbd6386
docs(plan): v2 — tesseract-rs-traineddata-gguf-v1.md
AdaWorldAPI Jun 15, 2026
eeb0e7f
docs(plan): v2 — tesseract-rs-transcode-master-v1.md
AdaWorldAPI Jun 15, 2026
5b7ba97
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
aa0ef21
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
abdbf9d
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI Jun 15, 2026
c089518
docs(plan): helix residue = 48-bit (2x ResidueEdge), distinct from 48…
AdaWorldAPI Jun 15, 2026
39a23de
docs(plan): helix residue = phi-spiral endpoint-pair edge (3B/24bit),…
AdaWorldAPI Jun 15, 2026
4c46707
docs(plan): helix residue in BITS (24b/edge, 48b/token); disown bogus…
AdaWorldAPI Jun 15, 2026
df48c87
docs(plan): helix residue = 24-bit golden index (probe #495); helix-4…
AdaWorldAPI Jun 15, 2026
209ea6a
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI Jun 15, 2026
6e316fa
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI Jun 15, 2026
298e8e9
docs(plan): aerial = lance-graph-arm-discovery (Aerial+ codebook-dist…
AdaWorldAPI Jun 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions .claude/plans/ocr-canonical-soa-integration-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# OCR → Canonical SoA Integration v1

> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53.
> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for."
> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`).
> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ).
> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve.

---

## 0. Intent

An OCR token is not a foreign payload that needs a boundary adapter — it **is** a
canonical SoA node. This plan defines the mapping so recognized text lands directly
in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema`
preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by
DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole
point of the splat-native / "one representation, many views" doctrine, applied to OCR.

## 1. OCR token → `NodeRow` mapping (D-OCR-50)

**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`.
- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it.
- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`).
- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin.
→ `local_key()` (trailing 6 B) addresses a token within its line after the trie walk.

**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):**
- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line
neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology.
- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column
parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry).

## 2. OCR class + HHTL address scheme (D-OCR-50)

- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block →
Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`,
`Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space
per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0).
- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and
`value_schema` (the OCR preset, §3).

## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)

The 480-byte value slab already carves into `VALUE_TENANTS`. OCR **rides existing
tenants** — no new tenant for the POC:

| Tenant (existing) | OCR use |
|---|---|
| `Fingerprint` (32 B / 256-bit) | glyph/line identity print (DeepNSM `encoder` XOR-bind/bundle of the crop) |
| `TurbovecResidue` (16 B, PQ) | glyph embedding → CAKES nearest-valid-token search |
| `HelixResidue` (48 B) | orthogonal residue: per-token deviation from class centroid (confidence-as-residue) |
| `Meta` (u64) | packed confidence + NSM-repair flags + token-subtype bits |
| `EntityType` (u16) | OCR token class discriminator (Word/Number/Date/Glyph/TableCell) |
| `Plasticity` (u32) | correction history / last-repair stamp |

→ define `ValueSchema::Ocr` (or select `Cognitive` if its mask already covers the
above) as a `FieldMask` over those `ValueTenant` positions. Selection only — it
carves *within* the slab, moves nothing (canon: tenants never move/reuse).

**OD-1 (deferred):** a dedicated `ValueTenant::OcrEvidence` (bbox `[f16;4]` +
per-char confidence + top-k recodebeam candidates) is the clean home for
recognizer evidence. Adding a tenant is canon-significant, so the POC packs a
compressed form into `Meta`+`HelixResidue` and defers the dedicated tenant to a
follow-up once the evidence shape is stable (needs D-OCR-21 `lstm_choice_mode`).

## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)

The recognizer emits candidates+confidence; repair is the brainstem we already have:
- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m`
confusion table + number/date/currency/table-cell grammars. Repairs orthography on
OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only
genuinely greenfield code; the word-frequency half already exists as
`deepnsm/word_frequency`.)
- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder`
→ `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
(PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA flags anomalous
tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.

Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
provenance → `Meta`/`Plasticity`.
Comment thread
AdaWorldAPI marked this conversation as resolved.

## 5. Persistence + planner (kv-lance / surreal)

- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork).
OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot.
- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST
adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→
repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series
of throughput, AST API for the repair-grammar (compile-time vs JIT grammars).

## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff

The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for
the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port
text AND the resulting `NodeRow` bytes. Because every stage is supposed to be
bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64
locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the
migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and
SIMD numeric exactness. OCR is the best external oracle the substrate has.

## 7. Deliverables

- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class.
- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token
round-trips token→NodeRow→token with no geometry change.
- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility.
- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
SoA migration suite.

## 8. Open decisions

- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides).
- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab
sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.)
- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling
`coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.)
86 changes: 86 additions & 0 deletions .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1

> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
> **Status:** PLANTED 2026-06-15 — design only.
> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is hand-ported or replaced.

---

## 0. Intent

Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder,
dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic,
reviewable codegen harness** rather than by hand — so the faithful tier is
auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a
**Rust emission backend built on the `ruff` AST/codegen crates**.

## 1. Why ruff (honest scoping)

`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse
**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the
mature, battle-tested **Rust-side AST → source emission discipline**:
`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting
IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural
invariant checks on a typed AST). We reuse those *patterns and crates* as the
emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is
clang.

```
C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle
```

## 2. The "AST DLL" — D-OCR-40

The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"):
a libclang traversal that dumps the subset we transcode (struct/enum decls, plain
methods, table initializers, fixed-size array walks) as a typed IR — independent of
clang version drift, so the emission step is reproducible. Functions touching
pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are
**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code —
already skipped, per master §3).

## 3. Rust emission via ruff crates — D-OCR-41

A `RustAst` builder consumes the IR and emits idiomatic Rust:
- field-by-field struct/enum transcription (canon: byte layout preserved);
- table/array initializers → `const`/`static` Rust tables;
- the emission goes through ruff's formatter IR so output is deterministic and
diff-stable (re-running codegen produces byte-identical source).
- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct
(no silent re-ordering / re-widening — the same invariant the SoA envelope audit
enforces).

## 4. Diff-gate — D-OCR-42

Every codegen'd module is validated against the FFI oracle:
- behavioral: emitted Rust function vs `libtesseract` function on the same inputs
(e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal;
- structural: `dto_check` confirms each emitted struct's byte image matches the C++
`sizeof`/offset dump.
Codegen output is committed (not generated at build) so reviewers see real Rust;
the harness is re-runnable to prove the commit equals the generator output.

## 5. Module assignment (codegen vs hand vs replace)

| C++ area | Route |
|---|---|
| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
| `textord`/`ccstruct` layout, Leptonica | **REPLACE** (ocrs / minimal imageproc) — never enters the harness |

## 6. Deliverables

- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works.
- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical.
- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle.

## 7. Open decisions

- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by
a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang
JSON dump for v1 (decoupled, reproducible), libclang later if needed.
- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable
`AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.)
79 changes: 79 additions & 0 deletions .claude/plans/tesseract-rs-lstm-recodebeam-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# tesseract-rs — LSTM Forward + recodebeam Decoder v1

> **Type:** plan (sub-plan). Deliverables D-OCR-20/21/22.
> **Status:** PLANTED 2026-06-15 — design only.
> **Front:** post-#496. Forward pass targets `ndarray` (SIMD/BLAS/CLAM provider). Oracle = `tesseract-rs` FFI fork.
> **Canon anchors:** master §4; `ndarray` SIMD kernels; bit-reproducibility doctrine (DeepNSM "bit-reproducible", envelope version-stamp).
> **Skip-by-rule:** no legacy matcher; no layout. Input is a line crop, output is text + per-step posteriors.

---

## 0. Intent

Run the hydrated LSTM (D-OCR-11) forward on `ndarray` to produce per-timestep
class posteriors, then decode them with a faithful `recodebeam` (dictionary-aware,
CTC-style) to text — **byte-identical to Tesseract** on fixed line crops. This is
the tier that makes "1:1 Tesseract" provable; everything else is plumbing.

## 1. Forward pass (`tesseract-rs/src/lstm/`, on ndarray) — D-OCR-20

Faithful transcode of `lstm/` numerics. Each maps to ndarray ops:

| Tesseract unit | ndarray realization | Exactness note |
|---|---|---|
| `WeightMatrix::MatrixDotVector` (int8) | int8 GEMV via ndarray SIMD kernel | **accumulation order + rounding must match** (D-OCR-22) |
| `FullyConnected` | matmul + bias + activation LUT | activation table must be the same fixed-point LUT |
| `LSTM` cell (gates i/f/o/g, peephole) | elementwise on ndarray slices | sigmoid/tanh LUTs identical to C++ |
| `Convolve` / `Maxpool` | im2col + GEMM / window-max | stride/pad identical |
| `Softmax` / `LogSoftmax` | row softmax | only at the output; feeds the beam |

Float path is straightforward. **The int8 path is where silent drift lives** — it
is the whole point of D-OCR-22.

## 2. recodebeam decoder (`tesseract-rs/src/recodebeam.rs`) — D-OCR-21

Hand-port (NOT codegen): tie-breaking, normalization, and dawg interaction are
under-documented and behaviorally subtle.

- Beam over the **recoder** codes (not raw unichars): the `RecodeBeamSearch`
maintains dawg-constrained and unconstrained beams; final path picks per
Tesseract's certainty/rating rule.
- DAWG dictionary (`dict/{dawg,trie,permdawg}`) — **codegen-amenable** node-array
walks; the *interaction* with the beam is hand-ported.
- Output: best text + per-token rating/certainty → becomes per-token confidence at
the emit stage (master §1).

## 3. int8-SIMD numeric exactness conformance — D-OCR-22

The conformance contract that earns "1:1":

1. Pin the int8 GEMV accumulation order to Tesseract's (block/tile order matters).
2. Match the fixed-point rounding mode of `IntSimdMatrix` (AVX2/512/NEON variants
reduce in a defined order — replicate it, do not "improve" it).
3. Identical activation LUTs (sigmoid/tanh/softmax) — copy the tables, not the
formulae.
4. Conformance harness: feed N line crops, compare per-timestep argmax AND the
full posterior (within 0 ULP for int8) against an FFI dump from the oracle.

## 4. Ground-truth oracle

The `AdaWorldAPI/tesseract-rs` FFI fork (thin bindings, `src/{lib,page_seg_mode}.rs`)
is built **only** as the oracle: it runs real `libtesseract` to dump (a) per-matrix
weights, (b) per-timestep posteriors, (c) final decoded text for the same crops.
The Rust port is diffed against these. The oracle is a dev/test dependency, never a
runtime path, and the lone place the Leptonica C fork is compiled.

## 5. Deliverables

- **D-OCR-20:** forward pass on ndarray reproduces C++ per-timestep posteriors
(float path 1:1; int8 path within the D-OCR-22 contract) on a 1k-crop set.
- **D-OCR-21:** `recodebeam` + DAWG reproduces C++ decoded text byte-identical on
the same set.
- **D-OCR-22:** int8 conformance harness green on ≥ 10k crops across 2+ languages.

## 6. Open decisions

- **OD-20a:** target one SIMD width first (AVX2) for exactness, then NEON/AVX-512;
or define the scalar reference as canonical and treat SIMD as "must equal scalar"?
- **OD-21a:** support Tesseract's `lstm_choice_mode` (top-k per timestep) now — it
feeds the OCR `ValueTenant` top-k candidates (ocr-soa-integration OD-1) — or later?
Loading