Fix double-load issue in detect_pdf_type function by NotAShelf · Pull Request #34 · firecrawl/pdf-inspector

NotAShelf · 2026-04-14T20:25:54Z

Hi, this is a relatively small PR to simplify and streamline how PDF files are loaded and their page counts are determined. Previously Document::load_metadata() would read the file once and then Document::load() would read it again and thus yielding two file reads. Note that Document::load() on a path handles file reading internally. I've made it so that std::fs::read() reads it once and Document::load_mem() parses it from memory. This aligns detect_pdf_type_with_config with load_document_from_path in lib.rs, which already used this pattern.

This should eliminate one redundant parse for the same buffer, which I think is more semantically correct. Though, one potential correctness issue is metadata.page_count (from PDF's /Count key) vs doc.get_pages().len() (via lopdf's parsed page count). If a PDF has a malformed page tree, these could differ. The original code used the PDF's claimed count; my code uses lopdf's successfully-parsed count. I have not verified this specific edge case personally, but my tests yield appropriate and faster results. One test fixture (2013-app2.pdf) has lopdf parse errors and previously reported 9 pages from /Count vs 8 from get_pages().len(). The new behavior seems more correct. Integration tests also continue to pass. In short, the behavioral difference is that my version:

Is faster (on account of having eliminated redundant I/O/parsing)
Has slightly different page count semantics that actually seems more correct

This should reduce code complexity and potential redundancy a little bit. It's also faster! On my system total avg time went down from 8.95ms to 7.56ms, and the 9-page PDF parsing got down from 4.99ms to 4.00ms. I'd say ~15-20% operations are pretty neat :)

Worth nothing that I haven't tested the library in my own code yet; I'm currently on a memory-constrained system, and the project I'm tryin to use pdf-inspector on OOMs :(

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I4090d1ae3b9bd018fb447ab6cf66b6ad6a6a6964

abimaelmartell

Thank you so much for this contribution, looks good, have a few comments from code review agent:

Good catch on the double-load. The two changes here are both solid:

1. Eliminating redundant I/O/parsing — detect_pdf_type_with_config was calling load_metadata (reads + parses file) then load (reads + parses file again). Reading once into a buffer and using load_mem is the right fix, and matches how load_document_from_path in lib.rs already works. Same improvement for the _mem variant — no reason to parse the buffer twice just to get a page count.

2. Page count semantics — Using doc.get_pages().len() instead of the metadata /Count is more correct. The /Count key reflects the PDF's claimed page tree size, which can be wrong in malformed documents. get_pages().len() reflects what lopdf actually parsed, which is what the rest of the detection logic already uses internally (total_pages = pages.len() as u32 at line 164). The extraction path in lib.rs already uses this approach too. The 2013-app2.pdf case (9 claimed vs 8 parsed) is a good example — reporting 9 pages when only 8 are parseable is misleading.

One minor note: the file-path version of detect_pdf_type_with_config doesn't apply structure_tree::fix_bare_struct_names() preprocessing that the extraction path does in load_document_from_mem. That's fine for detection since the detector doesn't inspect struct tree roles, but worth noting that the two load paths aren't fully unified. Not a blocker for this PR — just something to be aware of if detection ever starts using struct tree info.

Overall this is a clean simplification. Fewer lines, fewer allocations, more consistent with the extraction path, and more correct page count semantics. 👍

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I3495f1934796a2f4fc954fbde2daf2b96a6a6964

NotAShelf · 2026-04-16T09:58:15Z

Thanks for the review @abimaelmartell. I've pushed a quick comment to address the minor note. Hopefully this satisfies the review agent.

I think a full-blown "fix" would be wildly out of scope for this pull request, but I am happy to work on the issue itself in the future if it ever becomes a large concern.

NotAShelf added 2 commits April 14, 2026 23:12

fix: eliminate double-load in detect_pdf_type functions

992ec44

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I4090d1ae3b9bd018fb447ab6cf66b6ad6a6a6964

Merge branch 'firecrawl:main' into main

8a0b51c

abimaelmartell reviewed Apr 15, 2026

View reviewed changes

docs: note asymmetry in load paths between detector and extraction

a084ed1

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I3495f1934796a2f4fc954fbde2daf2b96a6a6964

Merge branch 'firecrawl:main' into main

ae69d35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-load issue in detect_pdf_type function#34

Fix double-load issue in detect_pdf_type function#34
NotAShelf wants to merge 4 commits into
firecrawl:mainfrom
NotAShelf:main

NotAShelf commented Apr 14, 2026

Uh oh!

abimaelmartell left a comment •

edited

Loading

Uh oh!

NotAShelf commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NotAShelf commented Apr 14, 2026

Uh oh!

abimaelmartell left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NotAShelf commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abimaelmartell left a comment •

edited

Loading