feat: add extractFormulasInRegions for native formula text extraction by abimaelmartell · Pull Request #40 · firecrawl/pdf-inspector

abimaelmartell · 2026-04-15T17:52:01Z

Summary

Add extractFormulasInRegions NAPI export (and extract_formulas_in_regions_mem Rust API) for extracting text from formula bounding-box regions with formula-specific quality validation
Formula text is legitimately symbol-heavy (Greek letters, math operators, subscripts), so the generic is_garbage_text check (requires >50% alphanumeric) would false-positive on valid formula regions. The new endpoint skips that check and instead uses is_formula_garbage which catches actual decode failures: PUA characters from undecoded TeX extensible delimiters (>10%) and control characters from broken font encodings (>30%)
Refactor shared page-extraction boilerplate into prepare_region_extraction, eliminating ~40 lines of duplication between extract_text_in_regions_mem and extract_tables_in_regions_mem

Test plan

cargo fmt — clean
cargo clippy -- -D warnings — zero warnings
cargo test — 104 passed (includes 5 new is_formula_garbage unit tests)
pdf-evals — 0 regressions (0 changed, 180 unchanged)

🤖 Generated with Claude Code

Add a new region extraction endpoint that uses formula-specific quality checks instead of the generic text garbage detector. Formula text is legitimately symbol-heavy (Greek letters, math operators, subscripts), so the standard is_garbage_text check — which requires >50% alphanumeric characters — would false-positive on valid formula regions. The new is_formula_garbage validator catches actual decode failures: PUA characters from undecoded TeX extensible delimiters (>10%) and control characters from broken font encodings (>30%). Also refactors the shared page-extraction boilerplate into prepare_region_extraction, eliminating duplication across extract_text_in_regions_mem and extract_tables_in_regions_mem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abimaelmartell mentioned this pull request Apr 15, 2026

feat: heuristic LaTeX recovery for formula regions (0.7.5) #42

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add extractFormulasInRegions for native formula text extraction#40

feat: add extractFormulasInRegions for native formula text extraction#40
abimaelmartell wants to merge 1 commit into
mainfrom
abimaelmartell/formula-extraction

abimaelmartell commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

abimaelmartell commented Apr 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant