Skip to content

docs(plan): Tesseract → tesseract-rs 1:1 transcode (LSTM hosted via embedanything) — v2#497

Merged
AdaWorldAPI merged 25 commits into
mainfrom
plan/tesseract-rs-transcode
Jun 16, 2026
Merged

docs(plan): Tesseract → tesseract-rs 1:1 transcode (LSTM hosted via embedanything) — v2#497
AdaWorldAPI merged 25 commits into
mainfrom
plan/tesseract-rs-transcode

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 15, 2026

Copy link
Copy Markdown
Owner

v2 — corrected. 1:1 behavioral transcode of ALL Tesseract (layout included); the LSTM forward is the ONLY swapped component, HOSTED on the existing runbook (.traineddata → GGUF → embedanything DTO/candle → ndarray AMX, bgz_tensor store, per .grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md).

Plans

  • tesseract-rs-transcode-master-v1 (v2) — 1:1 everything; LSTM-only swap; D-OCR-NN index + DAG.
  • tesseract-rs-traineddata-gguf-v1.traineddata → GGUF → embedanything(candle) host; bgz_tensor weight store.
  • tesseract-rs-layout-transcode-v1textord/ccstruct 1:1, raw-pointer faithful (the ~200k LOC bulk); minimal Leptonica ops.
  • tesseract-rs-recodebeam-transcode-v1 — decoder transcoded over HOSTED posteriors.
  • tesseract-rs-ast-dll-codegen-v1 (v2) — clang→IR→Rust via ruff; layout now in-scope (raw-pointer), not replaced.
  • ocr-canonical-soa-integration-v1 — OCR token = canonical NodeRow + DeepNSM/CAM-PQ repair.

Corrections vs v1

  • Layout is 1:1 transcoded, not replaced by ocrs (v1 wrongly skipped it).
  • LSTM is hosted (GGUF→embedanything→ndarray), not kernel-transcoded.
  • unsafe/raw-pointer is the accepted faithful image of intrusive C++; safe-refactor is a later oracle-gated pass.
  • One DTO extension required: infer_sequence → [T,C] per-timestep posteriors (D-OCR-15).

Retired v1 plans (traineddata-ndarray, lstm-recodebeam, neural-layout-ocrs) deleted in this branch.

Summary by CodeRabbit

  • Documentation
    • Added architectural planning documents defining OCR system integration strategies, transcoding approaches, layout processing pipelines, and model hosting designs to establish foundational technical direction for OCR capabilities.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds seven design-only markdown documents to .claude/plans/: a revised master v2 transcode plan and six sub-plans covering traineddata GGUF hosting, recodebeam decoder transcode, layout (textord/ccstruct) transcode, AST-DLL C++→Rust codegen harness, OCR→canonical SoA NodeRow integration with repair pipeline, and a SoA centroid attention field synthesis spec.

Changes

OCR Integration Architecture Design Plans

Layer / File(s) Summary
Master transcode plan v2
.claude/plans/tesseract-rs-transcode-master-v1.md
Versions the master plan to v2 (design locked), maps subsystems to AST-DLL codegen vs hand-ported vs raw-pointer unsafe Rust, introduces the additive DTO sequence-output variant for per-timestep posteriors, enumerates D-OCR deliverable IDs with critical-path ordering, and updates success criteria to byte-identical oracle parity on crops and full-page layouts.
Traineddata GGUF hosting and recodebeam decoder
.claude/plans/tesseract-rs-traineddata-gguf-v1.md, .claude/plans/tesseract-rs-recodebeam-transcode-v1.md
Traineddata plan specifies parsing .traineddata components, exporting to GGUF via bgz_tensor with preserved int8 scaling, and running inference via embedanything::infer_sequence. Recodebeam plan scopes a decoder-only transcode sourcing [T, n_classes] posteriors from D-OCR-16, defines DAWG-constrained beam behavior, and sets byte-identical decoder output as the acceptance criterion.
Layout transcode and AST-DLL codegen harness
.claude/plans/tesseract-rs-layout-transcode-v1.md, .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
Layout plan mandates raw-pointer/intrusive-node Rust for byte-for-byte textord/ccstruct reproduction with limited hand-ported Leptonica ops. AST-DLL plan defines Clang AST stable IR extraction, ruff-disciplined Rust emission, dto_check-style structural invariants, and a behavioral diff-gate against the libtesseract FFI oracle.
OCR→canonical SoA NodeRow emission and repair
.claude/plans/ocr-canonical-soa-integration-v1.md
Defines NodeGuid byte layout and EdgeBlock topology for OCR tokens, specifies ValueSchema::Ocr as a FieldMask over existing tenants, lays out the character-confusion → DeepNSM → Helix/CAM-PQ/CAKES repair pipeline with field writeback, documents NodeRowPacketSoaEnvelope→Lance persistence with surreal_container job control, specifies a bit-reproducibility golden-file harness, and enumerates deliverables D-OCR-50/51/52/53 with open decisions.
SoA centroid attention field synthesis
.claude/plans/soa-centroid-attention-field-synthesis-v1.md
Defines a single centroid attention field (48-bit helix residue + Morton-tile perturbation shader from a φ-spiral template), enumerates multi-scale field reads, describes Phase-2 plasticity, wires ONNX recognition at the query boundary, and enforces a frozen vs plastic determinism split with OD-A/B/C open items.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • AdaWorldAPI/lance-graph#496: The OCR→canonical SoA integration plan's ValueSchema::Ocr FieldMask and ValueTenant model depends directly on the NodeRow::value slab substrate introduced in that PR.

Poem

🐇 Hop, hop — the plans are laid,
In markdown fields where designs are made.
From GGUF weights to NodeRow bytes,
The master plan v2 ignites.
No code yet shipped, but the rabbit knows —
A thorough spec is how great software grows! 📜✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: introducing a v2 corrected transcode plan for Tesseract-to-Rust with LSTM hosted via embedanything. It is concise, clear, and directly reflects the primary focus of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ddbd9775e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .claude/plans/ocr-canonical-soa-integration-v1.md
Comment thread .claude/plans/tesseract-rs-transcode-master-v1.md Outdated
Comment thread .claude/plans/tesseract-rs-neural-layout-ocrs-v1.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (9)
.claude/plans/ocr-canonical-soa-integration-v1.md (2)

79-79: 💤 Low value

Clarify CHAODA reference.

Line 79 mentions "CHAODA flags anomalous tokens" without introducing what CHAODA is. If it's an external algorithm or crate, add a brief note or parenthetical definition for readers unfamiliar with it (e.g., "(Clustered Hierarchical Outlier Detection via Aggregation — see crate X)").

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/ocr-canonical-soa-integration-v1.md at line 79, The reference
to CHAODA on line 79 lacks context for unfamiliar readers. Add a brief
clarification by including a parenthetical note or expanded definition
immediately after the first mention of CHAODA that explains what it stands for
(e.g., "Clustered Hierarchical Outlier Detection via Aggregation") and
optionally references the relevant crate or documentation where it is defined,
so readers understand the purpose and origin of this algorithm.

57-65: ⚖️ Poor tradeoff

Validate ValueSchema preset selection.

The plan states "define ValueSchema::Ocr (or select Cognitive if its mask already covers the above)." Looking at the provided context, Cognitive includes Meta/Qualia/Fingerprint/Energy/Plasticity/EntityType but not TurbovecResidue or HelixResidue. The OCR use table (lines 48-55) lists both TurbovecResidue and HelixResidue as OCR tenants. This means either:

  1. Define a new ValueSchema::Ocr variant that includes all required tenants, or
  2. Adjust the OCR tenant list to match an existing schema.

The current tentative wording ("or select") is good design hygiene for a deferred decision, but the D-OCR-51 acceptance should clarify which path was chosen (new schema vs existing), since the choice affects the FieldMask declaration and the value-slab carve.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/ocr-canonical-soa-integration-v1.md around lines 57 - 65,
Verify and clarify the ValueSchema selection for OCR tenants in the plan. Check
that the chosen ValueSchema variant (either the new ValueSchema::Ocr or an
existing preset like Cognitive) includes all OCR-required tenants listed in the
OCR use table (TurbovecResidue and HelixResidue). Update the plan text to
explicitly state which path was selected—either define a new ValueSchema::Ocr
variant that encompasses all required tenants or adjust the OCR tenant list to
match an existing schema. Ensure this decision is documented in D-OCR-51
acceptance criteria so the FieldMask declaration and value-slab carve
implementation are unambiguous.
.claude/plans/tesseract-rs-traineddata-ndarray-v1.md (1)

37-46: 💤 Low value

Add language specifier to directory-tree code block.

Line 37 opens a code fence without a language tag. Add ```text or ```tree for clarity.

🔧 Proposed fix
- ```
+ ```text
 traineddata/
   container.rs     // TessdataManager: offset table parse → component byte slices  (CODEGEN: D-OCR-40)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-traineddata-ndarray-v1.md around lines 37 - 46,
The opening code fence at line 37 in the markdown file is missing a language
specifier, making it unclear what format the code block represents. Change the
opening backticks from ``` to ```text to properly identify the directory tree
structure and improve readability in the rendered markdown.

Source: Linters/SAST tools

.claude/plans/tesseract-rs-transcode-master-v1.md (2)

3-19: 💤 Low value

Fix blockquote formatting: remove extra spaces after >.

Lines 3-19 (the metadata header) have multiple spaces after the blockquote symbol >, violating MD027. Single space is standard Markdown.

🔧 Proposed fix: normalize blockquote spacing
- > **Type:** plan family root (forward marker / co-architecture). Plants the
+ > **Type:** plan family root (forward marker / co-architecture). Plants the
-   sub-plans; owns the deliverable index, the dependency DAG, and the
+   sub-plans; owns the deliverable index, the dependency DAG, and the
-   skip-list rationale.
+   skip-list rationale.
- > **Status:** PLANTED 2026-06-15 — design only, no code. Layout/contracts proposed
+ > **Status:** PLANTED 2026-06-15 — design only, no code. Layout/contracts proposed
-   against the post-#496 front.
+   against the post-#496 front.

(Apply similarly to lines 7, 9-11, 13-14, 15-17, 19.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-transcode-master-v1.md around lines 3 - 19, The
metadata header block in the file uses blockquote formatting with extra spaces
after the `>` symbol, violating MD027 formatting standards. Fix this by
normalizing all blockquote lines throughout the header section to use exactly
one space after the `>` symbol instead of multiple spaces. This applies to all
blockquote lines in the metadata header, including the Type, Status, Front,
Canon anchors, and Skip-by-rule sections.

Source: Linters/SAST tools


41-54: 💤 Low value

Add language specifier to diagram code block.

Line 41 opens a fenced code block without a language tag. Since this is a textual flowchart/diagram, add ```text or ```diagram for consistency and future syntax-highlighting.

🔧 Proposed fix
- ```
+ ```text
 PDF / image
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-transcode-master-v1.md around lines 41 - 54, The
fenced code block beginning with "PDF / image" is missing a language specifier
on the opening fence. Add a language tag (text or diagram) to the opening
backticks of this code block to enable proper syntax highlighting and maintain
consistency with markdown best practices. Change the opening ``` to ```text or
```diagram.

Source: Linters/SAST tools

.claude/plans/tesseract-rs-ast-dll-codegen-v1.md (2)

30-33: 💤 Low value

Add language specifier to flow diagram code block.

Line 30 opens a fenced code block (the C++ source → ... → formatted .rs flow) without a language tag. Add ```text for consistency.

🔧 Proposed fix
- ```
+ ```text
 C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-ast-dll-codegen-v1.md around lines 30 - 33, The
opening fence for the flow diagram code block (starting with the C++ source →
Clang AST → ... flow) is missing a language specifier. Change the opening ``` to
```text to properly tag the code block language for consistency with markdown
formatting standards.

Source: Linters/SAST tools


19-28: ⚖️ Poor tradeoff

Clarify ruff codegen adaptation scope.

The plan states we reuse ruff's patterns (codegen/formatter/dto_check) to emit Rust (not Python). This is honest, but §5 ("Module assignment") does not explicitly address the effort of adapting ruff_python_codegen / ruff_formatter to emit Rust source instead of Python. A brief note on whether those crates are language-agnostic or require shims would help scope the D-OCR-41 work.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-ast-dll-codegen-v1.md around lines 19 - 28, Add a
brief note in section 5 ("Module assignment") that explicitly scopes the
adaptation effort for reusing ruff's codegen and formatter crates. Clarify
whether ruff_python_codegen and ruff_formatter are language-agnostic enough to
emit Rust source with minimal changes, or whether they require shims/wrapper
layers to decouple them from Python-specific logic. This note should directly
address the D-OCR-41 work scope and help readers understand the actual effort
required to adapt these crates from Python emission to Rust emission.
.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md (2)

35-38: 💤 Low value

Add language specifier to pipeline diagram code block.

Line 35 opens a fenced code block (the preprocess → ocrs::detection → ... → tokens+confidence flow) without a language tag. Add ```text for clarity.

🔧 Proposed fix
- ```
+ ```text
 preprocess (image/imageproc) ─► ocrs::detection ─► ocrs::layout_analysis (reading order)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md around lines 35 - 38,
The fenced code block containing the pipeline diagram (starting with preprocess
and flowing through ocrs::detection, ocrs::layout_analysis, etc.) is missing a
language specifier. Change the opening triple backticks from ``` to ```text to
explicitly declare the code block language type for proper markdown rendering
and clarity.

Source: Linters/SAST tools


40-42: ⚖️ Poor tradeoff

Clarify rten model asset availability.

The plan says "Confirm the converter + current model assets are present in the fork before relying on them (D-OCR-30 acceptance)." This is a prudent gating criterion, but the acceptance gate should explicitly state: (1) which ONNX models are the source (detection + recognition), (2) whether rten-convert is confirmed to work on them, and (3) whether .rten blobs are vendored in the fork or fetched/converted at build time. Currently deferred to acceptance; recommend documenting the exact check-list in a follow-up.

Would you like me to open a follow-up ticket to detail the rten asset inventory and conversion steps as part of D-OCR-30 acceptance?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md around lines 40 - 42,
The D-OCR-30 acceptance gate described in the Models section is too vague about
what must be confirmed before relying on the rten conversion approach. Expand
this acceptance criterion to explicitly document three specific checks: (1)
identify which ONNX models serve as the source for detection and recognition,
(2) confirm that rten-convert successfully processes those models, and (3)
clarify whether the resulting .rten blobs will be vendored directly in the fork
or fetched and converted at build time. This ensures the acceptance gate
provides a concrete checklist rather than leaving the verification steps
ambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.claude/plans/ocr-canonical-soa-integration-v1.md:
- Line 79: The reference to CHAODA on line 79 lacks context for unfamiliar
readers. Add a brief clarification by including a parenthetical note or expanded
definition immediately after the first mention of CHAODA that explains what it
stands for (e.g., "Clustered Hierarchical Outlier Detection via Aggregation")
and optionally references the relevant crate or documentation where it is
defined, so readers understand the purpose and origin of this algorithm.
- Around line 57-65: Verify and clarify the ValueSchema selection for OCR
tenants in the plan. Check that the chosen ValueSchema variant (either the new
ValueSchema::Ocr or an existing preset like Cognitive) includes all OCR-required
tenants listed in the OCR use table (TurbovecResidue and HelixResidue). Update
the plan text to explicitly state which path was selected—either define a new
ValueSchema::Ocr variant that encompasses all required tenants or adjust the OCR
tenant list to match an existing schema. Ensure this decision is documented in
D-OCR-51 acceptance criteria so the FieldMask declaration and value-slab carve
implementation are unambiguous.

In @.claude/plans/tesseract-rs-ast-dll-codegen-v1.md:
- Around line 30-33: The opening fence for the flow diagram code block (starting
with the C++ source → Clang AST → ... flow) is missing a language specifier.
Change the opening ``` to ```text to properly tag the code block language for
consistency with markdown formatting standards.
- Around line 19-28: Add a brief note in section 5 ("Module assignment") that
explicitly scopes the adaptation effort for reusing ruff's codegen and formatter
crates. Clarify whether ruff_python_codegen and ruff_formatter are
language-agnostic enough to emit Rust source with minimal changes, or whether
they require shims/wrapper layers to decouple them from Python-specific logic.
This note should directly address the D-OCR-41 work scope and help readers
understand the actual effort required to adapt these crates from Python emission
to Rust emission.

In @.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md:
- Around line 35-38: The fenced code block containing the pipeline diagram
(starting with preprocess and flowing through ocrs::detection,
ocrs::layout_analysis, etc.) is missing a language specifier. Change the opening
triple backticks from ``` to ```text to explicitly declare the code block
language type for proper markdown rendering and clarity.
- Around line 40-42: The D-OCR-30 acceptance gate described in the Models
section is too vague about what must be confirmed before relying on the rten
conversion approach. Expand this acceptance criterion to explicitly document
three specific checks: (1) identify which ONNX models serve as the source for
detection and recognition, (2) confirm that rten-convert successfully processes
those models, and (3) clarify whether the resulting .rten blobs will be vendored
directly in the fork or fetched and converted at build time. This ensures the
acceptance gate provides a concrete checklist rather than leaving the
verification steps ambiguous.

In @.claude/plans/tesseract-rs-traineddata-ndarray-v1.md:
- Around line 37-46: The opening code fence at line 37 in the markdown file is
missing a language specifier, making it unclear what format the code block
represents. Change the opening backticks from ``` to ```text to properly
identify the directory tree structure and improve readability in the rendered
markdown.

In @.claude/plans/tesseract-rs-transcode-master-v1.md:
- Around line 3-19: The metadata header block in the file uses blockquote
formatting with extra spaces after the `>` symbol, violating MD027 formatting
standards. Fix this by normalizing all blockquote lines throughout the header
section to use exactly one space after the `>` symbol instead of multiple
spaces. This applies to all blockquote lines in the metadata header, including
the Type, Status, Front, Canon anchors, and Skip-by-rule sections.
- Around line 41-54: The fenced code block beginning with "PDF / image" is
missing a language specifier on the opening fence. Add a language tag (text or
diagram) to the opening backticks of this code block to enable proper syntax
highlighting and maintain consistency with markdown best practices. Change the
opening ``` to ```text or ```diagram.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7f730765-e8d2-491b-bd1f-8414c331fab8

📥 Commits

Reviewing files that changed from the base of the PR and between 2e58e03 and 6ddbd97.

📒 Files selected for processing (6)
  • .claude/plans/ocr-canonical-soa-integration-v1.md
  • .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
  • .claude/plans/tesseract-rs-lstm-recodebeam-v1.md
  • .claude/plans/tesseract-rs-neural-layout-ocrs-v1.md
  • .claude/plans/tesseract-rs-traineddata-ndarray-v1.md
  • .claude/plans/tesseract-rs-transcode-master-v1.md

@AdaWorldAPI AdaWorldAPI changed the title docs(plan): Tesseract → tesseract-rs transcode plan family (OCR → canonical SoA) docs(plan): Tesseract → tesseract-rs 1:1 transcode (LSTM hosted via embedanything) — v2 Jun 15, 2026
…CR-53; pure-Rust front-end — addresses Codex
…CR-53; pure-Rust front-end — addresses Codex
…CR-53; pure-Rust front-end — addresses Codex
@AdaWorldAPI

Copy link
Copy Markdown
Owner Author

Addressed the Codex findings:

  • P1 (text preservation): removed the Fingerprint/EntityType hash mapping — that was wrong. An OCR token is the terminal of the perturbation cascade, not a stored/hashed string. Text reconstructs as codebook_index(Meta) + residue(HelixResidue 48b ⊕ TurbovecResidue PQ) decoded via the DeepNSM Morton-tile stacked-pyramid shader cascade → CAKES nearest-valid-token over the codebook. True-OOV falls back to the recoder-code residue (recodebeam emits codes, not pixels). Reversible without a hash or a string column.
  • P2 (D-OCR-53 ordering): D-OCR-53 now depends on D-OCR-50,51 (row layout defined before byte golden-diff).
  • P2 (PDFium/zero-C): carried into the layout plan — front-end must be pure-Rust (ferrules/image), not pdfium-render.
  • Defined ValueSchema::Ocr explicitly (Cognitive lacks Helix/Turbovec); glossed CHAODA.

Copy link
Copy Markdown
Owner Author

Taking ownership of this PR. All three codex findings are dispositioned against the current branch (298e8e93):

  • P1 — text preservation (ocr-canonical-soa-integration-v1.md): an OCR token reconstructs as codebook_index(Meta) + residue (helix 24-bit golden-index ⊕ TurbovecResidue PQ) → CAKES nearest-valid-token, recoder-code fallback for true-OOV. No Fingerprint hash, no string column; ValueSchema::Ocr defined. Resolved.
  • P2 — D-OCR-53 ordering (tesseract-rs-transcode-master-v1.md): the golden-diff harness now depends on D-OCR-21,30,50,51 (row layout defined before the byte diff). Resolved.
  • P2 — PDFium vs zero-C : the neural-layout-ocrs plan was deleted and the front-end is pure-Rust (ferrules/image), so the zero-C acceptance gate holds. Resolved (outdated).

Title note: "embedanything" is the deliberate LSTM-hosting layer (.traineddata → GGUF → embedanything DTO/candle → ndarray AMX, per the body + .grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md) — kept as accurate. CodeRabbit's title check only scans the six changed plan files, not the runbook, so that warning is a false positive.

CodeRabbit's initial pass was rate-limited and never ran — re-triggering now.

@coderabbitai review


Generated by Claude Code

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

@AdaWorldAPI: Acknowledged — all three dispositions noted:

  • P1 (text preservation): Fingerprint/hash mapping replaced with codebook_index(Meta) + residue(HelixResidue 48b ⊕ TurbovecResidue PQ) → CAKES nearest-valid-token + recoder-code OOV fallback; ValueSchema::Ocr defined explicitly. ✅
  • P2 (D-OCR-53 ordering): Golden-diff harness now correctly depends on D-OCR-21,30,50,51 (row layout before byte diff). ✅
  • P2 (PDFium/zero-C): Neural-layout plan replaced; front-end is pure-Rust (ferrules/image). ✅
  • Title ("embedanything"): Understood — it's the deliberate LSTM-hosting layer documented in the runbook (.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md), not reflected in the six changed plan files alone. Noted as intentional. ✅

Triggering a fresh review now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/plans/tesseract-rs-transcode-master-v1.md:
- Around line 74-76: The critical path on line 76 currently shows "40 → {10,30}
→ 16 → 21 → 42 → 53" but D-OCR-53's dependency list on line 74 includes D-OCR-50
and D-OCR-51 which are not represented in this path. Resolve this inconsistency
by either updating the critical path to explicitly include nodes D-OCR-50 and
D-OCR-51 in the sequence leading to D-OCR-53 (ensuring they appear between node
42 and 53), or remove D-OCR-50 and D-OCR-51 from the dependency list on line 74
and add a clear explanation of why these dependencies are not required for the
critical path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 04fb155f-267b-46a1-85bb-5ef10b509f50

📥 Commits

Reviewing files that changed from the base of the PR and between 6ddbd97 and 298e8e9.

📒 Files selected for processing (7)
  • .claude/plans/ocr-canonical-soa-integration-v1.md
  • .claude/plans/soa-centroid-attention-field-synthesis-v1.md
  • .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
  • .claude/plans/tesseract-rs-layout-transcode-v1.md
  • .claude/plans/tesseract-rs-recodebeam-transcode-v1.md
  • .claude/plans/tesseract-rs-traineddata-gguf-v1.md
  • .claude/plans/tesseract-rs-transcode-master-v1.md
✅ Files skipped from review due to trivial changes (5)
  • .claude/plans/tesseract-rs-layout-transcode-v1.md
  • .claude/plans/tesseract-rs-recodebeam-transcode-v1.md
  • .claude/plans/tesseract-rs-traineddata-gguf-v1.md
  • .claude/plans/soa-centroid-attention-field-synthesis-v1.md
  • .claude/plans/tesseract-rs-ast-dll-codegen-v1.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .claude/plans/ocr-canonical-soa-integration-v1.md

Comment thread .claude/plans/tesseract-rs-transcode-master-v1.md
@AdaWorldAPI AdaWorldAPI merged commit cfcd4af into main Jun 16, 2026
1 check passed
AdaWorldAPI pushed a commit that referenced this pull request Jun 16, 2026
…5-specialist framing)

Five specialists (cascade / family-codec / palette / dto-soa / truth-architect)
framed the merged #497 OCR-transcode plans against the post-#498 substrate. Two
showstoppers + 6-way drift; all 7 plans corrected:

- HelixResidue 48 B → 6 B everywhere (a stored Signed360 index, not a 48-byte field);
  budgets/carve rebaselined (Full 112, [32,144)); headers #496#498.
- "Morton-tile stacked-pyramid perturbation-shader" purged (does not exist; Morton
  rejected for Hilbert) → real primitives (mipmap pyramid / HHTL depth-cascade / CAKES).
- "reversible without a hash" reframed: no residue→rank inverse exists; node =
  identity → content-store lookup, codebook = repair signal (I-VSA-IDENTITIES).
- §0 tripwires: no ValueSchema::Ocr variant (ride Full/Compressed); Meta de-overloaded
  (confidence→Energy, provenance→Plasticity, OOV→content-store); TurbovecResidue is the
  edge codec, glyph→word uses DeepNSM CamCodes.
- master critical path 42→53 becomes 42→{50,51}→53 (resolves the open #497 CodeRabbit Major).

New ocr-probes-v1.md specs the 4 gating probes (OCR-RT/DET/POST/SCHEMA) for the
unmeasured claims (int8-exact LSTM, bit-reproducible diff, 200k-LOC 1:1 layout).
OCR-SCHEMA shipped as a contract test proving OCR rides an existing preset.
EPIPHANIES E-OCR-PLAN-DRIFT-1 + AGENT_LOG entry.

contract lib green; fmt clean.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
AdaWorldAPI added a commit that referenced this pull request Jun 16, 2026
docs(plans)+test: rebaseline #497 OCR plans to #498 + gating probes (5-specialist framing)
AdaWorldAPI pushed a commit that referenced this pull request Jun 16, 2026
…oaOwner cherry-pick + LanceVersionScheduler + SurrealMailboxView (D-PG-6)

Lands four tasks from the shortest-unblocking-path list surfaced after
PR #497-#501 + the AdaWorldAPI/surrealdb fork bump (lance/lance-index =7.0.0,
lancedb =0.30.0, ndarray exact-rev). All four meet at the single contract
trait `MailboxSoaView`, closing the cascade in one commit (E-UNBLOCK-CASCADE-1).

## Task 3 — `NiblePath::{from_guid_prefix, prefix}` (zero-dep, foundational)

Ontology-side keystone follow-up of PR #498's `classid → ReadMode` LE contract.
The 20-nibble `classid(8) | HEEL(4) | HIP(4) | TWIG(4)` prefix overflows the
16-nibble MAX_DEPTH: the deterministic fold drops the canon-reserved high u16
of classid (root-first pack: `classid_lo(4) | HEEL(4) | HIP(4) | TWIG(4)`),
returning None when the fold would be lossy. `prefix(d)` is the O(1) ancestor
view; `prefix(d).is_ancestor_of(self)` holds for every d ≤ self.depth (the
routing-cache view of a deeper class path).

  +7 tests in `hhtl::tests`; contract lib 619 → 632 green.

## Task 2 — `impl MailboxSoaView + MailboxSoaOwner for MailboxSoA<N>`

Cherry-pick of jolly-cori-clnf9 commit 463d71b (integrated-cognitive-planner-v1
§2 Seam #3, +149 LOC). Adds `pub phase: KanbanColumn` field + zero-copy
repr(transparent) slice impls (edges_raw, meta_raw) + the in-RAM Rubicon
driving-loop test (`test_in_ram_driving_loop_walks_rubicon_to_commit`). The
contract spine (#437/#439) now drives an actual loop end-to-end — no surreal,
no ractor bus needed for the in-process case.

  +1 driving-loop test; cognitive-shader-driver lib 85 → 86 green.

## Task 1 — `LanceVersionScheduler` over `VersionedGraph::versions()`

D-MBX-9-IN core impl (the CI-gated twin of the contract slice shipped 2026-05-31).
Lives in `crates/lance-graph/src/graph/scheduler.rs`. Wraps a `VersionedGraph` +
inner `VersionScheduler<S = NextPhaseScheduler>` and exposes:

- `drive_once(view, exec)`           — read current Lance version, lower to a move
- `drive_at_latest(view, exec)`      — fold `versions().last()` into a move
- `current_dataset_version()`        — typed `DatasetVersion` over nodes head

Closes `E-SUBSTRATE-IS-THE-SCHEDULER`'s OUT-direction end-to-end. The OUT
direction stays propose-not-dispose (R1): returned `KanbanMove` is for the
caller's `MailboxSoaOwner::try_advance_phase` to apply.

  +5 tests with real on-disk tempdir Lance (no mocks).

## Task 4 — `SurrealMailboxView<'a>` (D-PG-6 contract slice)

Read-only `MailboxSoaView` adapter the SurrealQL projection populates via
`from_columns(...)` — pure zero-copy borrow over the kv-lance scan's byte
buffers. Imports `MailboxSoaView` but NOT `MailboxSoaOwner` (compile-time
enforcement of `kanban.rs:1-21` "surreal=project-read-only, callcenter=commit").

`read_via_kv_lance()` returns the new typed `SurrealContainerError::BlockedColdBuild`
until the surrealdb fork dep in `Cargo.toml` is uncommented — kept off by default
to avoid the ~10 min cold surrealdb build for contributors who don't need it.
The contract surface is available today; the integration is one Cargo.toml edit
+ a SurrealQL projection body in `view.rs`.

  +4 tests; new `lance-graph-contract` dep in surreal_container/Cargo.toml;
  BLOCKED(C) marker flipped to RESOLVED.

## What this unblocks

- **D-MBX-9-IN-impl** — SHIPPED (the contract trait now has a Lance-backed implementor).
- **D-MBX-A6-P3** — still queued, BUT Seam #3 (the in-RAM loop) is now in-tree;
  a downstream session can wire the emit-side without depending on the unmerged
  jolly branch.
- **D-PG-6 (Rubicon kanban VIEW)** — contract slice SHIPPED; impl-side gated on
  `BlockedColdBuild` flip-on (one Cargo.toml uncomment + projection body).
- **Identity-architecture v1 §3 P-SCOPE-CLASSIFY** — solved (the bijection-width
  fix is deterministic + ancestor-preserving + falsifiable by tests).

## Tests + clippy

- lance-graph-contract:   **632** (+7 hhtl)
- cognitive-shader-driver: **86** (+1 driving-loop)
- lance-graph::scheduler:  **5** (new module, real Lance tempdir)
- surreal_container::view: **4** (new module)

All clippy `-D warnings` clean on the new files. Pre-existing lints in
lance-graph-ontology / lance-graph-planner / ndarray_bridge.rs are out of
session scope.

## Board hygiene (mandatory rule)

- LATEST_STATE.md — Contract Inventory PREPEND for the new types.
- EPIPHANIES.md — E-UNBLOCK-CASCADE-1: three independent landings converge on
  one trait surface, closing four queued deliverables in one commit.
- AGENT_LOG.md — task-by-task summary with test counts.

https://claude.ai/code/session_01Xzyc27Nx3f8WC5KzwfWfjx
AdaWorldAPI pushed a commit that referenced this pull request Jun 16, 2026
…oaOwner cherry-pick + LanceVersionScheduler + SurrealMailboxView (D-PG-6)

Lands four tasks from the shortest-unblocking-path list surfaced after
PR #497-#501 + the AdaWorldAPI/surrealdb fork bump (lance/lance-index =7.0.0,
lancedb =0.30.0, ndarray exact-rev). All four meet at the single contract
trait `MailboxSoaView`, closing the cascade in one commit (E-UNBLOCK-CASCADE-1).

## Task 3 — `NiblePath::{from_guid_prefix, prefix}` (zero-dep, foundational)

Ontology-side keystone follow-up of PR #498's `classid → ReadMode` LE contract.
The 20-nibble `classid(8) | HEEL(4) | HIP(4) | TWIG(4)` prefix overflows the
16-nibble MAX_DEPTH: the deterministic fold drops the canon-reserved high u16
of classid (root-first pack: `classid_lo(4) | HEEL(4) | HIP(4) | TWIG(4)`),
returning None when the fold would be lossy. `prefix(d)` is the O(1) ancestor
view; `prefix(d).is_ancestor_of(self)` holds for every d ≤ self.depth (the
routing-cache view of a deeper class path).

  +7 tests in `hhtl::tests`; contract lib 619 → 632 green.

## Task 2 — `impl MailboxSoaView + MailboxSoaOwner for MailboxSoA<N>`

Cherry-pick of jolly-cori-clnf9 commit 463d71b (integrated-cognitive-planner-v1
§2 Seam #3, +149 LOC). Adds `pub phase: KanbanColumn` field + zero-copy
repr(transparent) slice impls (edges_raw, meta_raw) + the in-RAM Rubicon
driving-loop test (`test_in_ram_driving_loop_walks_rubicon_to_commit`). The
contract spine (#437/#439) now drives an actual loop end-to-end — no surreal,
no ractor bus needed for the in-process case.

  +1 driving-loop test; cognitive-shader-driver lib 85 → 86 green.

## Task 1 — `LanceVersionScheduler` over `VersionedGraph::versions()`

D-MBX-9-IN core impl (the CI-gated twin of the contract slice shipped 2026-05-31).
Lives in `crates/lance-graph/src/graph/scheduler.rs`. Wraps a `VersionedGraph` +
inner `VersionScheduler<S = NextPhaseScheduler>` and exposes:

- `drive_once(view, exec)`           — read current Lance version, lower to a move
- `drive_at_latest(view, exec)`      — fold `versions().last()` into a move
- `current_dataset_version()`        — typed `DatasetVersion` over nodes head

Closes `E-SUBSTRATE-IS-THE-SCHEDULER`'s OUT-direction end-to-end. The OUT
direction stays propose-not-dispose (R1): returned `KanbanMove` is for the
caller's `MailboxSoaOwner::try_advance_phase` to apply.

  +5 tests with real on-disk tempdir Lance (no mocks).

## Task 4 — `SurrealMailboxView<'a>` (D-PG-6 contract slice)

Read-only `MailboxSoaView` adapter the SurrealQL projection populates via
`from_columns(...)` — pure zero-copy borrow over the kv-lance scan's byte
buffers. Imports `MailboxSoaView` but NOT `MailboxSoaOwner` (compile-time
enforcement of `kanban.rs:1-21` "surreal=project-read-only, callcenter=commit").

`read_via_kv_lance()` returns the new typed `SurrealContainerError::BlockedColdBuild`
until the surrealdb fork dep in `Cargo.toml` is uncommented — kept off by default
to avoid the ~10 min cold surrealdb build for contributors who don't need it.
The contract surface is available today; the integration is one Cargo.toml edit
+ a SurrealQL projection body in `view.rs`.

  +4 tests; new `lance-graph-contract` dep in surreal_container/Cargo.toml;
  BLOCKED(C) marker flipped to RESOLVED.

## What this unblocks

- **D-MBX-9-IN-impl** — SHIPPED (the contract trait now has a Lance-backed implementor).
- **D-MBX-A6-P3** — still queued, BUT Seam #3 (the in-RAM loop) is now in-tree;
  a downstream session can wire the emit-side without depending on the unmerged
  jolly branch.
- **D-PG-6 (Rubicon kanban VIEW)** — contract slice SHIPPED; impl-side gated on
  `BlockedColdBuild` flip-on (one Cargo.toml uncomment + projection body).
- **Identity-architecture v1 §3 P-SCOPE-CLASSIFY** — solved (the bijection-width
  fix is deterministic + ancestor-preserving + falsifiable by tests).

## Tests + clippy

- lance-graph-contract:   **632** (+7 hhtl)
- cognitive-shader-driver: **86** (+1 driving-loop)
- lance-graph::scheduler:  **5** (new module, real Lance tempdir)
- surreal_container::view: **4** (new module)

All clippy `-D warnings` clean on the new files. Pre-existing lints in
lance-graph-ontology / lance-graph-planner / ndarray_bridge.rs are out of
session scope.

## Board hygiene (mandatory rule)

- LATEST_STATE.md — Contract Inventory PREPEND for the new types.
- EPIPHANIES.md — E-UNBLOCK-CASCADE-1: three independent landings converge on
  one trait surface, closing four queued deliverables in one commit.
- AGENT_LOG.md — task-by-task summary with test counts.

https://claude.ai/code/session_01Xzyc27Nx3f8WC5KzwfWfjx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant