Fix double-load issue in detect_pdf_type function#34
Conversation
Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I4090d1ae3b9bd018fb447ab6cf66b6ad6a6a6964
There was a problem hiding this comment.
Thank you so much for this contribution, looks good, have a few comments from code review agent:
Good catch on the double-load. The two changes here are both solid:
1. Eliminating redundant I/O/parsing — detect_pdf_type_with_config was calling load_metadata (reads + parses file) then load (reads + parses file again). Reading once into a buffer and using load_mem is the right fix, and matches how load_document_from_path in lib.rs already works. Same improvement for the _mem variant — no reason to parse the buffer twice just to get a page count.
2. Page count semantics — Using doc.get_pages().len() instead of the metadata /Count is more correct. The /Count key reflects the PDF's claimed page tree size, which can be wrong in malformed documents. get_pages().len() reflects what lopdf actually parsed, which is what the rest of the detection logic already uses internally (total_pages = pages.len() as u32 at line 164). The extraction path in lib.rs already uses this approach too. The 2013-app2.pdf case (9 claimed vs 8 parsed) is a good example — reporting 9 pages when only 8 are parseable is misleading.
One minor note: the file-path version of detect_pdf_type_with_config doesn't apply structure_tree::fix_bare_struct_names() preprocessing that the extraction path does in load_document_from_mem. That's fine for detection since the detector doesn't inspect struct tree roles, but worth noting that the two load paths aren't fully unified. Not a blocker for this PR — just something to be aware of if detection ever starts using struct tree info.
Overall this is a clean simplification. Fewer lines, fewer allocations, more consistent with the extraction path, and more correct page count semantics. 👍
Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I3495f1934796a2f4fc954fbde2daf2b96a6a6964
|
Thanks for the review @abimaelmartell. I've pushed a quick comment to address the minor note. Hopefully this satisfies the review agent. I think a full-blown "fix" would be wildly out of scope for this pull request, but I am happy to work on the issue itself in the future if it ever becomes a large concern. |
Hi, this is a relatively small PR to simplify and streamline how PDF files are loaded and their page counts are determined. Previously
Document::load_metadata()would read the file once and thenDocument::load()would read it again and thus yielding two file reads. Note thatDocument::load()on a path handles file reading internally. I've made it so thatstd::fs::read()reads it once andDocument::load_mem()parses it from memory. This alignsdetect_pdf_type_with_configwithload_document_from_pathinlib.rs, which already used this pattern.This should eliminate one redundant parse for the same buffer, which I think is more semantically correct. Though, one potential correctness issue is
metadata.page_count(from PDF's/Countkey) vsdoc.get_pages().len()(via lopdf's parsed page count). If a PDF has a malformed page tree, these could differ. The original code used the PDF's claimed count; my code uses lopdf's successfully-parsed count. I have not verified this specific edge case personally, but my tests yield appropriate and faster results. One test fixture (2013-app2.pdf) has lopdf parse errors and previously reported 9 pages from/Countvs 8 fromget_pages().len(). The new behavior seems more correct. Integration tests also continue to pass. In short, the behavioral difference is that my version:This should reduce code complexity and potential redundancy a little bit. It's also faster! On my system total avg time went down from 8.95ms to 7.56ms, and the 9-page PDF parsing got down from 4.99ms to 4.00ms. I'd say ~15-20% operations are pretty neat :)
Worth nothing that I haven't tested the library in my own code yet; I'm currently on a memory-constrained system, and the project I'm tryin to use pdf-inspector on OOMs :(