Skip to content

Fix double-load issue in detect_pdf_type function#34

Open
NotAShelf wants to merge 4 commits into
firecrawl:mainfrom
NotAShelf:main
Open

Fix double-load issue in detect_pdf_type function#34
NotAShelf wants to merge 4 commits into
firecrawl:mainfrom
NotAShelf:main

Conversation

@NotAShelf
Copy link
Copy Markdown

Hi, this is a relatively small PR to simplify and streamline how PDF files are loaded and their page counts are determined. Previously Document::load_metadata() would read the file once and then Document::load() would read it again and thus yielding two file reads. Note that Document::load() on a path handles file reading internally. I've made it so that std::fs::read() reads it once and Document::load_mem() parses it from memory. This aligns detect_pdf_type_with_config with load_document_from_path in lib.rs, which already used this pattern.

This should eliminate one redundant parse for the same buffer, which I think is more semantically correct. Though, one potential correctness issue is metadata.page_count (from PDF's /Count key) vs doc.get_pages().len() (via lopdf's parsed page count). If a PDF has a malformed page tree, these could differ. The original code used the PDF's claimed count; my code uses lopdf's successfully-parsed count. I have not verified this specific edge case personally, but my tests yield appropriate and faster results. One test fixture (2013-app2.pdf) has lopdf parse errors and previously reported 9 pages from /Count vs 8 from get_pages().len(). The new behavior seems more correct. Integration tests also continue to pass. In short, the behavioral difference is that my version:

  1. Is faster (on account of having eliminated redundant I/O/parsing)
  2. Has slightly different page count semantics that actually seems more correct

This should reduce code complexity and potential redundancy a little bit. It's also faster! On my system total avg time went down from 8.95ms to 7.56ms, and the 9-page PDF parsing got down from 4.99ms to 4.00ms. I'd say ~15-20% operations are pretty neat :)


Worth nothing that I haven't tested the library in my own code yet; I'm currently on a memory-constrained system, and the project I'm tryin to use pdf-inspector on OOMs :(

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: I4090d1ae3b9bd018fb447ab6cf66b6ad6a6a6964
Copy link
Copy Markdown
Member

@abimaelmartell abimaelmartell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution, looks good, have a few comments from code review agent:

Good catch on the double-load. The two changes here are both solid:

1. Eliminating redundant I/O/parsingdetect_pdf_type_with_config was calling load_metadata (reads + parses file) then load (reads + parses file again). Reading once into a buffer and using load_mem is the right fix, and matches how load_document_from_path in lib.rs already works. Same improvement for the _mem variant — no reason to parse the buffer twice just to get a page count.

2. Page count semantics — Using doc.get_pages().len() instead of the metadata /Count is more correct. The /Count key reflects the PDF's claimed page tree size, which can be wrong in malformed documents. get_pages().len() reflects what lopdf actually parsed, which is what the rest of the detection logic already uses internally (total_pages = pages.len() as u32 at line 164). The extraction path in lib.rs already uses this approach too. The 2013-app2.pdf case (9 claimed vs 8 parsed) is a good example — reporting 9 pages when only 8 are parseable is misleading.

One minor note: the file-path version of detect_pdf_type_with_config doesn't apply structure_tree::fix_bare_struct_names() preprocessing that the extraction path does in load_document_from_mem. That's fine for detection since the detector doesn't inspect struct tree roles, but worth noting that the two load paths aren't fully unified. Not a blocker for this PR — just something to be aware of if detection ever starts using struct tree info.

Overall this is a clean simplification. Fewer lines, fewer allocations, more consistent with the extraction path, and more correct page count semantics. 👍

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: I3495f1934796a2f4fc954fbde2daf2b96a6a6964
@NotAShelf
Copy link
Copy Markdown
Author

Thanks for the review @abimaelmartell. I've pushed a quick comment to address the minor note. Hopefully this satisfies the review agent.

I think a full-blown "fix" would be wildly out of scope for this pull request, but I am happy to work on the issue itself in the future if it ever becomes a large concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants