forked from lance-format/lance-graph
-
Notifications
You must be signed in to change notification settings - Fork 0
docs(plan): Tesseract → tesseract-rs 1:1 transcode (LSTM hosted via embedanything) — v2 #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
d911a1c
docs(plan): plant ocr-canonical-soa-integration-v1.md
AdaWorldAPI 9725503
docs(plan): plant tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI 1e24600
docs(plan): plant tesseract-rs-lstm-recodebeam-v1.md
AdaWorldAPI 09b0b4e
docs(plan): plant tesseract-rs-neural-layout-ocrs-v1.md
AdaWorldAPI 1c9736f
docs(plan): plant tesseract-rs-traineddata-ndarray-v1.md
AdaWorldAPI 6ddbd97
docs(plan): plant tesseract-rs-transcode-master-v1.md
AdaWorldAPI b021d2c
docs(plan): retire tesseract-rs-traineddata-ndarray-v1.md (v2 superse…
AdaWorldAPI 5d6b51c
docs(plan): retire tesseract-rs-lstm-recodebeam-v1.md (v2 supersession)
AdaWorldAPI 7f85b58
docs(plan): retire tesseract-rs-neural-layout-ocrs-v1.md (v2 superses…
AdaWorldAPI a324864
docs(plan): v2 — ocr-canonical-soa-integration-v1.md
AdaWorldAPI 357634e
docs(plan): v2 — tesseract-rs-ast-dll-codegen-v1.md
AdaWorldAPI 2dc636b
docs(plan): v2 — tesseract-rs-layout-transcode-v1.md
AdaWorldAPI 534a9c5
docs(plan): v2 — tesseract-rs-recodebeam-transcode-v1.md
AdaWorldAPI dbd6386
docs(plan): v2 — tesseract-rs-traineddata-gguf-v1.md
AdaWorldAPI eeb0e7f
docs(plan): v2 — tesseract-rs-transcode-master-v1.md
AdaWorldAPI 5b7ba97
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI aa0ef21
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI abdbf9d
docs(plan): residue-cascade text reconstruction (no hashes); gate D-O…
AdaWorldAPI c089518
docs(plan): helix residue = 48-bit (2x ResidueEdge), distinct from 48…
AdaWorldAPI 39a23de
docs(plan): helix residue = phi-spiral endpoint-pair edge (3B/24bit),…
AdaWorldAPI 4c46707
docs(plan): helix residue in BITS (24b/edge, 48b/token); disown bogus…
AdaWorldAPI df48c87
docs(plan): helix residue = 24-bit golden index (probe #495); helix-4…
AdaWorldAPI 209ea6a
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI 6e316fa
docs(plan): centroid attention field synthesis — helix residue as fie…
AdaWorldAPI 298e8e9
docs(plan): aerial = lance-graph-arm-discovery (Aerial+ codebook-dist…
AdaWorldAPI File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # OCR → Canonical SoA Integration v1 | ||
|
|
||
| > **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53. | ||
| > **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for." | ||
| > **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`). | ||
| > **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ). | ||
| > **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve. | ||
|
|
||
| --- | ||
|
|
||
| ## 0. Intent | ||
|
|
||
| An OCR token is not a foreign payload that needs a boundary adapter — it **is** a | ||
| canonical SoA node. This plan defines the mapping so recognized text lands directly | ||
| in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema` | ||
| preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by | ||
| DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole | ||
| point of the splat-native / "one representation, many views" doctrine, applied to OCR. | ||
|
|
||
| ## 1. OCR token → `NodeRow` mapping (D-OCR-50) | ||
|
|
||
| **Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`. | ||
| - `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it. | ||
| - HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`). | ||
| - `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin. | ||
| → `local_key()` (trailing 6 B) addresses a token within its line after the trie walk. | ||
|
|
||
| **Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):** | ||
| - in-family (12): reading-order + local-layout adjacency (prev/next token, same-line | ||
| neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology. | ||
| - out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column | ||
| parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry). | ||
|
|
||
| ## 2. OCR class + HHTL address scheme (D-OCR-50) | ||
|
|
||
| - Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block → | ||
| Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`, | ||
| `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space | ||
| per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0). | ||
| - `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and | ||
| `value_schema` (the OCR preset, §3). | ||
|
|
||
| ## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51) | ||
|
|
||
| The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not | ||
| a stored string and not a hash** — it is the *terminal of the perturbation cascade*, | ||
| reconstructed exactly like every other node. Text = codebook index + residue. | ||
|
|
||
| | Tenant (existing) | OCR role | | ||
| |---|---| | ||
| | helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. | | ||
| | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook | | ||
| | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV | | ||
| | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) | | ||
| | `Plasticity` (u32) | correction history / last-repair stamp | | ||
|
|
||
| **Reconstruction (this is the round-trip, and it answers Codex P1):** | ||
| `text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode = | ||
| the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to | ||
| the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` / | ||
| coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The | ||
| reversibility lives in residue + codebook, which is the architecture's whole point. | ||
|
|
||
| **True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the | ||
| **recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so | ||
| the codes themselves are the reversible payload in `Meta`, repaired by the | ||
| char-confusion grammar (D-OCR-52). Still a residue, never a hash. | ||
|
|
||
| **ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so | ||
| OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over | ||
| {`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only; | ||
| moves no tenant (canon: tenants never move/reuse). | ||
|
|
||
| ## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52) | ||
|
|
||
| The recognizer emits candidates+confidence; repair is the brainstem we already have: | ||
| - **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m` | ||
| confusion table + number/date/currency/table-cell grammars. Repairs orthography on | ||
| OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only | ||
| genuinely greenfield code; the word-frequency half already exists as | ||
| `deepnsm/word_frequency`.) | ||
| - **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder` | ||
| → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation. | ||
| - **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue` | ||
| (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous | ||
| tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`. | ||
|
|
||
| Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair | ||
| provenance → `Meta`/`Plasticity`. | ||
|
|
||
| ## 5. Persistence + planner (kv-lance / surreal) | ||
|
|
||
| - `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork). | ||
| OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot. | ||
| - `surreal_container` as the **OCR-job control plane** (per its role: planner / AST | ||
| adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→ | ||
| repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series | ||
| of throughput, AST API for the repair-grammar (compile-time vs JIT grammars). | ||
|
|
||
| ## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff | ||
|
|
||
| The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for | ||
| the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port | ||
| text AND the resulting `NodeRow` bytes. Because every stage is supposed to be | ||
| bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64 | ||
| locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the | ||
| migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and | ||
| SIMD numeric exactness. OCR is the best external oracle the substrate has. | ||
|
|
||
| ## 7. Deliverables | ||
|
|
||
| - **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class. | ||
| - **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token | ||
| round-trips token→NodeRow→token with no geometry change. | ||
| - **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known | ||
| OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility. | ||
| - **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the | ||
| SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must | ||
| define the row layout before bytes can be golden-diffed). | ||
|
|
||
| ## 8. Open decisions | ||
|
|
||
| - **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides). | ||
| - **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab | ||
| sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.) | ||
| - **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling | ||
| `coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.) | ||
69 changes: 69 additions & 0 deletions
69
.claude/plans/soa-centroid-attention-field-synthesis-v1.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # SoA Centroid Attention Field — Unified Synthesis v1 | ||
|
|
||
| > **Type:** plan (phase-2 marker / co-architecture). Unifies recognition + reasoning + grammar as reads of ONE field. | ||
| > **Status:** PLANTED 2026-06-15. Gated on `cycle-coherent-soa-snapshot-v1` (plastic field ⇒ COW writes). | ||
| > **Canon:** helix crate (golden-index residue, φ-template); deepnsm; causal-edge (pearl/nars); TEKAMOLO (#495). | ||
|
|
||
| --- | ||
|
|
||
| ## 0. The one idea | ||
|
|
||
| The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a | ||
| centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index) | ||
| = each point's perturbation off it = the **query↔key alignment**; the pyramid = | ||
| **multi-scale attention** (coarse centroid → fine). The field is *evaluated from the | ||
| φ-spiral template, never stored*. Everything below is a **read of this one field at | ||
| a different scale** — not separate engines bolted together. | ||
|
|
||
| ## 1. The reads (each is the same field, different scale) | ||
|
|
||
| | Capability | Real crate / source | What it is, as a field read | | ||
| |---|---|---| | ||
| | **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted | | ||
| | **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution | | ||
| | **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) | | ||
| | **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength | | ||
| | **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field | | ||
| | **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path | | ||
| | **Rule learning (the real "aerial")** | `lance-graph-arm-discovery::aerial` — Aerial+ transcode (arXiv 2504.19354), **autoencoder replaced by integer codebook-distance oracle** (palette256, ρ=0.9973 vs cosine) | mines SPO association rules **float-free / bitwise-deterministic** → `arm_to_truth_u8` → `CausalEdge64` confidence_u8 + i4 mantissa. This IS "learning edges" — the field's codebook distance replaces the f32 autoencoder. | | ||
| | **Episodic / coref** | AriGraph (`EpisodicWitness64`) | temporal chain read = the field over witness-time | | ||
| | **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word | | ||
|
|
||
| ## 2. Why this is one object, not a pipeline | ||
|
|
||
| VSA bind/bundle/similarity **are** the field operations: bind = perturbation off | ||
| centroid, bundle = the pyramid's coarse-level superposition, similarity = field | ||
| alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout* of | ||
| the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention | ||
| masks*. No separate learning machine is needed — the attention field already does | ||
| binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN | ||
| ⊁ VSA for symbol sequences). **`aerial` is the proof in-tree:** Aerial+'s f32 | ||
| autoencoder is replaced by the integer codebook-distance oracle (the field) and | ||
| still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed. | ||
| What's missing is only **plasticity** (centroid drift), not a learner. | ||
|
|
||
| ## 3. Phase-2: make the field plastic (the "learning edges") | ||
|
|
||
| Not new tenants — **the field adapts**: | ||
| - centroid **drift** (place-centroids move toward corpus density); | ||
| - shader **perturbation-gain** adaptation (the pyramid's response sharpens); | ||
| - timed by `Plasticity` tenant; coupled by `CausalEdge64` strength (NARS mantissa moves). | ||
| Evaluated from the φ-template (not materialized). **Hard dep:** `cycle-coherent-soa-snapshot` | ||
| COW — plastic field mutates per cycle; without snapshot it thrashes Lance. | ||
|
|
||
| ## 4. ONNX combination (operator's point) | ||
|
|
||
| The ONNX-shaped recognizer and the field **meet at the query boundary**: ONNX emits | ||
| posteriors → the field's golden-index query; field eval + grammar masks + NARS | ||
| coupling resolve to the token. So ONNX = the perceptual *encoder into* the field; | ||
| the field = everything symbolic/sequential/relational. One substrate, two scales. | ||
|
|
||
| ## 5. Determinism split (non-negotiable) | ||
| - **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here. | ||
| - **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot. | ||
| Two modes, explicitly separated, or the bit-repro guarantee is lost. | ||
|
|
||
| ## 6. Open | ||
| - **OD-A:** RESOLVED — "aerial" = `lance-graph-arm-discovery::aerial` (Aerial+ rule-mining, codebook-distance oracle, not AriGraph). | ||
| - **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.) | ||
| - **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # tesseract-rs — AST-DLL C++→Rust Codegen Harness v1 | ||
|
|
||
| > **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*. | ||
| > **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped. | ||
| > **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine. | ||
| > **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine). | ||
| > **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass. | ||
|
|
||
| --- | ||
|
|
||
| ## 0. Intent | ||
|
|
||
| Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder, | ||
| dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic, | ||
| reviewable codegen harness** rather than by hand — so the faithful tier is | ||
| auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a | ||
| **Rust emission backend built on the `ruff` AST/codegen crates**. | ||
|
|
||
| ## 1. Why ruff (honest scoping) | ||
|
|
||
| `ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse | ||
| **Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the | ||
| mature, battle-tested **Rust-side AST → source emission discipline**: | ||
| `ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting | ||
| IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural | ||
| invariant checks on a typed AST). We reuse those *patterns and crates* as the | ||
| emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is | ||
| clang. | ||
|
|
||
| ``` | ||
| C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder | ||
| ──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle | ||
| ``` | ||
|
|
||
| ## 2. The "AST DLL" — D-OCR-40 | ||
|
|
||
| The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"): | ||
| a libclang traversal that dumps the subset we transcode (struct/enum decls, plain | ||
| methods, table initializers, fixed-size array walks) as a typed IR — independent of | ||
| clang version drift, so the emission step is reproducible. Functions touching | ||
| pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are | ||
| **flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code — | ||
| already skipped, per master §3). | ||
|
|
||
| ## 3. Rust emission via ruff crates — D-OCR-41 | ||
|
|
||
| A `RustAst` builder consumes the IR and emits idiomatic Rust: | ||
| - field-by-field struct/enum transcription (canon: byte layout preserved); | ||
| - table/array initializers → `const`/`static` Rust tables; | ||
| - the emission goes through ruff's formatter IR so output is deterministic and | ||
| diff-stable (re-running codegen produces byte-identical source). | ||
| - a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct | ||
| (no silent re-ordering / re-widening — the same invariant the SoA envelope audit | ||
| enforces). | ||
|
|
||
| ## 4. Diff-gate — D-OCR-42 | ||
|
|
||
| Every codegen'd module is validated against the FFI oracle: | ||
| - behavioral: emitted Rust function vs `libtesseract` function on the same inputs | ||
| (e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal; | ||
| - structural: `dto_check` confirms each emitted struct's byte image matches the C++ | ||
| `sizeof`/offset dump. | ||
| Codegen output is committed (not generated at build) so reviewers see real Rust; | ||
| the harness is re-runnable to prove the commit equals the generator output. | ||
|
|
||
| ## 5. Module assignment (codegen vs hand vs replace) | ||
|
|
||
| | C++ area | Route | | ||
| |---|---| | ||
| | `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** | | ||
| | `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) | | ||
| | `textord`/`ccstruct` layout | **CODEGEN → faithful raw-pointer Rust (D-OCR-30)** — intrusive ELIST/CLIST transcribed 1:1, NOT replaced | | ||
| | Leptonica (~dozen ops only) | hand-port to image/imageproc (D-OCR-31) | | ||
|
|
||
| ## 6. Deliverables | ||
|
|
||
| - **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works. | ||
| - **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical. | ||
| - **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle. | ||
|
|
||
| ## 7. Open decisions | ||
|
|
||
| - **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by | ||
| a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang | ||
| JSON dump for v1 (decoupled, reproducible), libclang later if needed. | ||
| - **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable | ||
| `AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.