|
| 1 | +# OCR → Canonical SoA Integration v1 |
| 2 | + |
| 3 | +> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53. |
| 4 | +> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for." |
| 5 | +> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`). |
| 6 | +> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ). |
| 7 | +> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve. |
| 8 | +
|
| 9 | +--- |
| 10 | + |
| 11 | +## 0. Intent |
| 12 | + |
| 13 | +An OCR token is not a foreign payload that needs a boundary adapter — it **is** a |
| 14 | +canonical SoA node. This plan defines the mapping so recognized text lands directly |
| 15 | +in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema` |
| 16 | +preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by |
| 17 | +DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole |
| 18 | +point of the splat-native / "one representation, many views" doctrine, applied to OCR. |
| 19 | + |
| 20 | +## 1. OCR token → `NodeRow` mapping (D-OCR-50) |
| 21 | + |
| 22 | +**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`. |
| 23 | +- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it. |
| 24 | +- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`). |
| 25 | +- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin. |
| 26 | + → `local_key()` (trailing 6 B) addresses a token within its line after the trie walk. |
| 27 | + |
| 28 | +**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):** |
| 29 | +- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line |
| 30 | + neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology. |
| 31 | +- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column |
| 32 | + parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry). |
| 33 | + |
| 34 | +## 2. OCR class + HHTL address scheme (D-OCR-50) |
| 35 | + |
| 36 | +- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block → |
| 37 | + Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`, |
| 38 | + `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space |
| 39 | + per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0). |
| 40 | +- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and |
| 41 | + `value_schema` (the OCR preset, §3). |
| 42 | + |
| 43 | +## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51) |
| 44 | + |
| 45 | +The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not |
| 46 | +a stored string and not a hash** — it is the *terminal of the perturbation cascade*, |
| 47 | +reconstructed exactly like every other node. Text = codebook index + residue. |
| 48 | + |
| 49 | +| Tenant (existing) | OCR role | |
| 50 | +|---|---| |
| 51 | +| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. | |
| 52 | +| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook | |
| 53 | +| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV | |
| 54 | +| `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) | |
| 55 | +| `Plasticity` (u32) | correction history / last-repair stamp | |
| 56 | + |
| 57 | +**Reconstruction (this is the round-trip, and it answers Codex P1):** |
| 58 | +`text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode = |
| 59 | +the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to |
| 60 | +the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` / |
| 61 | +coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The |
| 62 | +reversibility lives in residue + codebook, which is the architecture's whole point. |
| 63 | + |
| 64 | +**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the |
| 65 | +**recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so |
| 66 | +the codes themselves are the reversible payload in `Meta`, repaired by the |
| 67 | +char-confusion grammar (D-OCR-52). Still a residue, never a hash. |
| 68 | + |
| 69 | +**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so |
| 70 | +OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over |
| 71 | +{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only; |
| 72 | +moves no tenant (canon: tenants never move/reuse). |
| 73 | + |
| 74 | +## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52) |
| 75 | + |
| 76 | +The recognizer emits candidates+confidence; repair is the brainstem we already have: |
| 77 | +- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m` |
| 78 | + confusion table + number/date/currency/table-cell grammars. Repairs orthography on |
| 79 | + OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only |
| 80 | + genuinely greenfield code; the word-frequency half already exists as |
| 81 | + `deepnsm/word_frequency`.) |
| 82 | +- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder` |
| 83 | + → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation. |
| 84 | +- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue` |
| 85 | + (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous |
| 86 | + tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`. |
| 87 | + |
| 88 | +Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair |
| 89 | +provenance → `Meta`/`Plasticity`. |
| 90 | + |
| 91 | +## 5. Persistence + planner (kv-lance / surreal) |
| 92 | + |
| 93 | +- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork). |
| 94 | + OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot. |
| 95 | +- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST |
| 96 | + adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→ |
| 97 | + repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series |
| 98 | + of throughput, AST API for the repair-grammar (compile-time vs JIT grammars). |
| 99 | + |
| 100 | +## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff |
| 101 | + |
| 102 | +The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for |
| 103 | +the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port |
| 104 | +text AND the resulting `NodeRow` bytes. Because every stage is supposed to be |
| 105 | +bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64 |
| 106 | +locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the |
| 107 | +migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and |
| 108 | +SIMD numeric exactness. OCR is the best external oracle the substrate has. |
| 109 | + |
| 110 | +## 7. Deliverables |
| 111 | + |
| 112 | +- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class. |
| 113 | +- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token |
| 114 | + round-trips token→NodeRow→token with no geometry change. |
| 115 | +- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known |
| 116 | + OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility. |
| 117 | +- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the |
| 118 | + SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must |
| 119 | + define the row layout before bytes can be golden-diffed). |
| 120 | + |
| 121 | +## 8. Open decisions |
| 122 | + |
| 123 | +- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides). |
| 124 | +- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab |
| 125 | + sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.) |
| 126 | +- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling |
| 127 | + `coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.) |
0 commit comments