[Optimization] ~42% LLM call reduction via hybrid heuristics (font detection + fuzzy matching)

Hi team! We've been using PageIndex on large documents (50–100+ pages) and found that a significant portion of LLM 
calls during indexing can be resolved locally without losing accuracy.

## Findings

Many LLM calls are for high-confidence decisions that local heuristics can handle:

| Decision Point | Current Behavior | Proposed Optimization |
|---|---|---|
| `find_toc_pages` | 1 LLM call per page scanned | Font/layout analysis — dot leaders, digit-ending lines, TOC keywords — resolves obvious cases locally |
| `verify_toc` | 1 LLM call per TOC entry | Fuzzy string matching pre-confirms title presence when confidence is high |
| `check_title_appearance_in_start` | 1 LLM call per node | Fuzzy match on the first ~300 chars of the target page |

**Initial test on an 86-page paper:** ~260 LLM calls → ~150 (~42% reduction) with identical verification accuracy.
Tested on a small set of documents — broader benchmarking across document types would be a useful next step.

Key constraint: heuristics only skip a call when confidence is high. Uncertain cases always escalate to the LLM. The self-verification loop is never bypassed.

> Note: Path 4 documents (no TOC, no headings) see minimal 
> savings — no structural signals exist for heuristics to 
> exploit, so LLM remains essential there.

## Approach

- **Font-based TOC detection** using PyMuPDF page layout  signals (line lengths, dot-leader patterns, digit-ending ratios, TOC keywords)

- **Fuzzy title matching** via `rapidfuzz` with per-path  confidence thresholds

- **`LLMCallCounter`** to track actual savings per document

- **`if_use_heuristics` config flag** (default: `yes`) or set to `no` to restore original behavior exactly. 
  Fully backward compatible, no breaking changes.

## Implementation

Full implementation and test results:
https://github.com/Unizoy/pageindex-optimized

Diff showing exact changes:
https://github.com/Unizoy/pageindex-optimized/pull/1

The implementation is complete and tested. Early results look promising and would love feedback from the team on 
edge cases we may not have hit yet. Happy to open a PR or discuss a different integration approach if preferred.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] ~42% LLM call reduction via hybrid heuristics (font detection + fuzzy matching) #232

Findings

Approach

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decision Point	Current Behavior	Proposed Optimization
`find_toc_pages`	1 LLM call per page scanned	Font/layout analysis — dot leaders, digit-ending lines, TOC keywords — resolves obvious cases locally
`verify_toc`	1 LLM call per TOC entry	Fuzzy string matching pre-confirms title presence when confidence is high
`check_title_appearance_in_start`	1 LLM call per node	Fuzzy match on the first ~300 chars of the target page

[Optimization] ~42% LLM call reduction via hybrid heuristics (font detection + fuzzy matching) #232

Description

Findings

Approach

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions