Skip to content

Commit f880141

Browse files
committed
docs(board): revert tesseract-rs soa wiring — pure-Rust OCR is the goal, not C bindings
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
1 parent 74749bc commit f880141

1 file changed

Lines changed: 3 additions & 1 deletion

File tree

.claude/board/LATEST_STATE.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,9 @@
1010
1111
---
1212

13-
> **2026-06-15 — cross-repo landed****tesseract-rs fork wired to the transcode.** `AdaWorldAPI/tesseract-rs` branch `claude/wonderful-hawking-lodtql` commit `1687c718`: opt-in `soa` feature (default-OFF — standalone OCR build untouched) + `src/soa.rs::tsv_to_nodes(tsv, classid, min_conf) -> Vec<NodeRow>` parsing tesseract `get_tsv_text` word rows → `contract::ocr::LayoutBlock``to_node_row`. Contract dep is a path dep mirroring smb-office-rs (sibling checkout). **Edition-2015 compatible** (the fork has no `edition` field → 2015: root `extern crate` + submodule root-relative `use` + explicit `TryInto` — all caught + fixed by verifying in a 2015 scratch crate against the real contract before pushing, 2 tests green). Pushed via `GH_TOKEN`+pygithub (out-of-MCP-scope fork). Could NOT compile the full crate here (no tesseract C-lib) — the transcode LOGIC is what's verified; the fork's own CI needs a co-located lance-graph for `--features soa`.
13+
> **2026-06-15 — REVERTED (operator)** — the tesseract-rs `soa` wiring below was **deleted** (branch reset to master `420de08`). Operator: *"we don't want to use original Tesseract, we want to transcode it into Rust — delete everything you copied from original Tesseract into tesseract-rs."* Wrapping the original Tesseract C engine + parsing its TSV is the wrong direction; the real goal is a **pure-Rust OCR**. The contract-side transcode (`LayoutBlock::to_node_row`) + keystone STAY — they are OCR-engine-agnostic (a pure-Rust OCR feeds the same `LayoutBlock``NodeRow`); only the original-Tesseract coupling was removed. The strike-through entry below is retained per APPEND-ONLY.
14+
>
15+
> ~~**2026-06-15 — cross-repo landed****tesseract-rs fork wired to the transcode.**~~ *(REVERTED — see above)* `AdaWorldAPI/tesseract-rs` branch `claude/wonderful-hawking-lodtql` commit `1687c718`: opt-in `soa` feature (default-OFF — standalone OCR build untouched) + `src/soa.rs::tsv_to_nodes(tsv, classid, min_conf) -> Vec<NodeRow>` parsing tesseract `get_tsv_text` word rows → `contract::ocr::LayoutBlock``to_node_row`. Contract dep is a path dep mirroring smb-office-rs (sibling checkout). **Edition-2015 compatible** (the fork has no `edition` field → 2015: root `extern crate` + submodule root-relative `use` + explicit `TryInto` — all caught + fixed by verifying in a 2015 scratch crate against the real contract before pushing, 2 tests green). Pushed via `GH_TOKEN`+pygithub (out-of-MCP-scope fork). Could NOT compile the full crate here (no tesseract C-lib) — the transcode LOGIC is what's verified; the fork's own CI needs a co-located lance-graph for `--features soa`.
1416
>
1517
> **2026-06-15 — branch work (post-#496)** — **tesseract OCR → NodeRow transcode POC (keystone payoff).** `lance_graph_contract::ocr::LayoutBlock::to_node_row(classid, identity) -> NodeRow` — the reference transcode any `OcrProvider` (tesseract-rs + others) reuses, the keystone end-to-end: `classid → classid_read_mode → ValueSchema` gates WHICH tenants land; `BlockKind::entity_type() -> u16` → `ValueTenant::EntityType`, `confidence: f32` → `ValueTenant::Energy`, each written at its canon slab offset via the new `ValueTenant::{value_offset(), byte_len()}` (derived accessors over the locked carve — not new properties). **`text`/`bbox` are NOT bundled** (`I-VSA-IDENTITIES`: node = identity + typed scalars; the string + pixel geometry live in an external content store keyed by `identity`). Schema-gated (`schema.has(t)` before each write) so a Bootstrap-resolving class writes an empty slab; transcoded rows ride the `SoaEnvelope` zero-copy (verified). §0 anti-invention: reuses the existing EntityType/Energy tenants, no "ocr_kind" field. +4 tests; **623 contract lib green; clippy `-D warnings` + fmt clean.** Lives in the contract (next to the `ocr` types it uses, zero-dep, testable here — no OCR C-lib, no fork); tesseract-rs just adds the contract dep + calls it (integration step). Branch, not yet a PR.
1618
>

0 commit comments

Comments
 (0)