Skip to content

feat: add extractFormulasInRegions for native formula text extraction#40

Open
abimaelmartell wants to merge 1 commit intomainfrom
abimaelmartell/formula-extraction
Open

feat: add extractFormulasInRegions for native formula text extraction#40
abimaelmartell wants to merge 1 commit intomainfrom
abimaelmartell/formula-extraction

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

Summary

  • Add extractFormulasInRegions NAPI export (and extract_formulas_in_regions_mem Rust API) for extracting text from formula bounding-box regions with formula-specific quality validation
  • Formula text is legitimately symbol-heavy (Greek letters, math operators, subscripts), so the generic is_garbage_text check (requires >50% alphanumeric) would false-positive on valid formula regions. The new endpoint skips that check and instead uses is_formula_garbage which catches actual decode failures: PUA characters from undecoded TeX extensible delimiters (>10%) and control characters from broken font encodings (>30%)
  • Refactor shared page-extraction boilerplate into prepare_region_extraction, eliminating ~40 lines of duplication between extract_text_in_regions_mem and extract_tables_in_regions_mem

Test plan

  • cargo fmt — clean
  • cargo clippy -- -D warnings — zero warnings
  • cargo test — 104 passed (includes 5 new is_formula_garbage unit tests)
  • pdf-evals — 0 regressions (0 changed, 180 unchanged)

🤖 Generated with Claude Code

Add a new region extraction endpoint that uses formula-specific quality
checks instead of the generic text garbage detector. Formula text is
legitimately symbol-heavy (Greek letters, math operators, subscripts),
so the standard is_garbage_text check — which requires >50% alphanumeric
characters — would false-positive on valid formula regions.

The new is_formula_garbage validator catches actual decode failures:
PUA characters from undecoded TeX extensible delimiters (>10%) and
control characters from broken font encodings (>30%).

Also refactors the shared page-extraction boilerplate into
prepare_region_extraction, eliminating duplication across
extract_text_in_regions_mem and extract_tables_in_regions_mem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant