Skip to content

[Optimization] ~42% LLM call reduction via hybrid heuristics (font detection + fuzzy matching) #232

@ShainaHussain

Description

@ShainaHussain

Hi team! We've been using PageIndex on large documents (50–100+ pages) and found that a significant portion of LLM
calls during indexing can be resolved locally without losing accuracy.

Findings

Many LLM calls are for high-confidence decisions that local heuristics can handle:

Decision Point Current Behavior Proposed Optimization
find_toc_pages 1 LLM call per page scanned Font/layout analysis — dot leaders, digit-ending lines, TOC keywords — resolves obvious cases locally
verify_toc 1 LLM call per TOC entry Fuzzy string matching pre-confirms title presence when confidence is high
check_title_appearance_in_start 1 LLM call per node Fuzzy match on the first ~300 chars of the target page

Initial test on an 86-page paper: ~260 LLM calls → ~150 (~42% reduction) with identical verification accuracy.
Tested on a small set of documents — broader benchmarking across document types would be a useful next step.

Key constraint: heuristics only skip a call when confidence is high. Uncertain cases always escalate to the LLM. The self-verification loop is never bypassed.

Note: Path 4 documents (no TOC, no headings) see minimal
savings — no structural signals exist for heuristics to
exploit, so LLM remains essential there.

Approach

  • Font-based TOC detection using PyMuPDF page layout signals (line lengths, dot-leader patterns, digit-ending ratios, TOC keywords)

  • Fuzzy title matching via rapidfuzz with per-path confidence thresholds

  • LLMCallCounter to track actual savings per document

  • if_use_heuristics config flag (default: yes) or set to no to restore original behavior exactly.
    Fully backward compatible, no breaking changes.

Implementation

Full implementation and test results:
https://github.com/Unizoy/pageindex-optimized

Diff showing exact changes:
Unizoy/pageindex-optimized#1

The implementation is complete and tested. Early results look promising and would love feedback from the team on
edge cases we may not have hit yet. Happy to open a PR or discuss a different integration approach if preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions