Add LiteParse local PDF ingestion engine (#2074)#2075
Conversation
Add LiteParseParser, a fully-local PDF parser built on LiteParse (run-llama/liteparse, PDFium + optional Tesseract OCR), as a third PDF engine alongside Docling and LlamaParse. LiteParse exposes line-level spatial text items (text + absolute PDF-point bounding boxes, top-left origin, with font metadata) but no element types or hierarchy. To match the other engines' core outputs the parser: - maps each LiteParse line bbox to word-level PAWLs tokens extracted with pdfplumber (same path as LlamaParse), via a shapely spatial index; - derives feature labels (Title / Section Header / Text Block) from font size (modal size = body; larger sizes ranked into heading levels); - builds a parent-child hierarchy with a heading stack walked in reading order, setting parent_id (fed to import_annotations + subtree groups); - extracts embedded images into the unified token array and emits Image annotations. Select with PDF_PARSER=liteparse; DOCX/PPTX fall back to Docling since LiteParse is PDF-only. liteparse is imported lazily so registry discovery and startup do not require it. Configuration via LITEPARSE_* settings; optional dependency in requirements/ingestors/liteparse.txt. Adds tests (opencontractserver/tests/test_doc_parser_liteparse.py) and docs (docs/pipelines/liteparse_parser.md). Closes #2074
Code Review |
Code ReviewThis PR adds LiteParseParser, a fully-local PDF ingestion engine backed by LiteParse (PDFium). The implementation is well-structured — it follows the LlamaParseParser patterns correctly, all cross-file function signatures match, and the test suite is thorough. Four confirmed bugs and several cleanup items follow. Confirmed Bugs1.
Fix: extract both from 2. Four
Fix: add the four missing 3. When Fix: either pass a page filter to 4. Degenerate-bbox guard is a no-op when an item falls exactly on the page edge After clamping, if a text item's reported x equals Fix: expand in the other direction when at the edge — e.g. Code Quality5. Every setting is defined in 6. CLAUDE.md §4 ("No magic numbers") requires hardcoded values to live in 7. The parser accesses 10+ attribute names ( Overall this is solid work. Fixing bugs 1 and 2 are the highest-priority items before merge; 3 and 4 matter if |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…e env vars Address code-review findings on PR #2075: - detect_headings / heading_size_ratio call-time kwargs were resolved in the docstring but dropped: _classify_heading_sizes read self.* directly. Thread both from _parse_document_impl through _convert_result_to_opencontracts into _classify_heading_sizes (falling back to instance settings when None) so the advertised per-call overrides actually take effect. Adds a regression test. - Define the four LITEPARSE_IMAGE_* env vars (FORMAT/QUALITY/MIN_IMAGE_WIDTH/ MIN_IMAGE_HEIGHT) in base.py so the Settings env_var declarations are honored by migrate_pipeline_settings instead of silently using dataclass defaults. - Export page_count now uses len(pawls_file_content) (the value save_parsed_data persists) rather than the count of LiteParse-parsed pages, so the two agree when target_pages restricts parsing to a subset. - _bounds_from_item: a text item pinned exactly to the page edge produced a zero-size box (the +1 guard clamped back to the edge); expand inward in that case so the bbox is always >=1pt. Adds a unit test. - Move the fallback US-Letter page dimensions to constants/document_processing.py (DEFAULT_PDF_PAGE_WIDTH/HEIGHT) and import them, removing the module-level magic numbers (CLAUDE.md "no magic numbers").
Add tests exercising the parser's previously-uncovered fallback paths so the codecov patch-coverage gate passes: - token-extraction failure -> synthesized empty PAWLs pages - image-extraction failure handled gracefully - optional construct kwargs (ocr_server_url / target_pages / max_pages / password) forwarded to LiteParse only when set - LiteParse.parse() exception -> returns None - invalid page_num / width / height -> default dimensions - full image metadata copied; out-of-range image page skipped - non-numeric font_size / coordinates and inverted bbox in the helpers
|
Thanks for the review — addressed in 29c0254 (fixes) and 18aea12 (coverage). Disposition of each item: Fixed
Not changed (intentional)
Also added tests covering the token-/image-extraction failure fallbacks and invalid-page-metadata paths to bring patch coverage back over the gate. Generated by Claude Code |
Review: Add LiteParse local PDF ingestion engine (#2074)This PR adds Bug —
|
…prefix Address the second code-review pass on PR #2075: - Clamp image_format to {jpeg,png} at use-time. cast() is a type-checker hint only; an operator value like "webp" would otherwise be saved as PNG bytes but tagged/extensioned "webp", producing broken data: URLs. Adds a test. - Expand the LiteParseParser PARSER_KWARGS entry to every setting _parse_document_impl actually resolves per-call (dpi, num_workers, target_pages, max_pages, password, ocr_language, ocr_server_url, image_mode, heading_size_ratio). image_* stay out because they are read from resolved settings in _append_image_tokens, not per-call (mirrors LlamaParseParser). - Add DOCUMENT_IMAGE_STORAGE_PREFIX to constants/document_processing.py and use it in the liteparse, llamaparse, and docling parsers (identical output), removing the triplicated "documents/" literal. - _classify_item now takes a size->level dict for an O(1) lookup instead of two O(n) list scans per text item; the dict is built once per document. - _classify_heading_sizes counts font sizes into a Counter incrementally rather than materialising an O(n_items) list first. Declined (out of scope for this PR): extracting _create_annotation and the fallback-page / image-token blocks into a shared base helper would refactor the well-tested LlamaParseParser path; tracked as a follow-up.
|
Thanks for the second pass — addressed in d3304eb. Fixed
Declined (out of scope)
All changes are Generated by Claude Code |
Code ReviewThis PR adds a solid, well-structured Bug — fallback
|
Code ReviewThis PR adds Three issues worth addressing before merge. 1.
|
Address the third/fourth review passes on PR #2075: - Fallback PAWLs page synthesis (used when pdfplumber token extraction raises) built a *compacted* list ordered by sorted(page_dimensions), breaking the "list position == absolute page index" invariant the success path guarantees. With target_pages restricting parsing to a non-zero-starting / non-contiguous range, _append_image_tokens (which indexes by absolute page number) then wrote images into the wrong page entries or dropped them. Rebuild the fallback over range(max_idx + 1), filling gaps, so positions always align. Adds a regression test (single parsed page at absolute index 1). - llamaparse_parser.py now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT instead of its own local 612/792 literals, completing the centralization this PR introduced. - Document why LiteParse's render DPI (self.dpi) is intentionally not forwarded to extract_images_from_pdf (it uses the pipeline-wide extraction DPI default, matching Docling/LlamaParse).
|
Thanks — addressing the last two review passes together (they overlap) in cb3c0bb. Fixed
Clarified (no behavior change)
Declined (cross-cutting; better as a dedicated follow-up)
All changes are Generated by Claude Code |
The mypy pre-commit hook flagged four calls that deliberately pass malformed types (None / "bad" / "big") into the typed make_page/make_item test helpers to exercise the parser's defensive coercion paths. Mark those four calls with # type: ignore[arg-type]; the helpers stay strictly typed for the normal calls.
Code ReviewThis PR adds 1. PDF password exposed in logs —
|
Code Review: LiteParseParser (#2074)This PR adds a clean third local PDF ingestion engine built on LiteParse. The overall structure — lazy import, settings dataclass, pdfplumber token mapping, shapely spatial index for bbox→token lookup, heading stack for parent-child hierarchy — follows the LlamaParse parser's patterns well. The test suite is thorough. A few issues below, ordered by severity. 1. Heading detection silently fails for heading-heavy documents [HIGH]File:
The code comment acknowledges the tie case ("on short documents a heading size can tie body for frequency") and applies a Common real-world triggers: slide decks, tables of contents, legal exhibits with repeated section labels. Suggest capping body-size detection at the smallest highly-frequent size rather than the absolute modal, or using a frequency-weighted approach. 2. Orphaned image tokens on mid-loop exception in
|
…hardening Address the fifth/sixth review passes on PR #2075: - SECURITY: BaseParser.parse_document logged the merged kwargs unredacted at INFO before the per-parser impl could redact, leaking SECRET settings (the new LiteParse PDF password, and pre-existing LlamaParse api_key) into logs. Redact via redact_sensitive_kwargs in the base log line — fixes every parser. - Heading detection now weights font sizes by CHARACTER MASS, not line frequency. Body prose dominates a document's characters, so this correctly identifies body text even when heading-style lines outnumber it (slide decks, TOCs, repeated exhibit labels) or when small footnotes are as frequent as body — cases the previous most-frequent + min() heuristic misclassified. Updated the hierarchy test to realistic line lengths and added a unit test for the heading-heavy/footnote scenario. - _append_image_tokens skips image dicts missing required keys, so a malformed entry can't raise mid-loop and strand partial image tokens the except block can't roll back (the append mutates pawls_pages in place). - Office-format fallback now keys off a named _PDF_ONLY_PARSER_NAMES set instead of an inline "liteparse" string, so adding another PDF-only engine stays a one-line change rather than a silent DOCX-routing bug. - Documented that, like the word-token pass, image extraction spans the whole PDF; with target_pages the full token layer is present but only parsed pages are annotated (harmless — per-page token indices). - Removed a dead else branch in the import-error test cleanup.
|
Thanks — addressing the last two review rounds together (they overlap) in e9c8eee. Fixed
Clarified, no behavior change
Declined (dedicated follow-up)
All changes Generated by Claude Code |
Code ReviewThis PR adds a well-structured Bug — Image token
|
…e constants Address the seventh review pass on PR #2075: - BUG (regression from the prior commit's malformed-image guard): _append_image_tokens skipped malformed images when appending tokens but still stored the full raw list in images_by_page, so the annotation loop's token_offset + img_idx pointed past the real slot for every image after a skipped one. Store only the successfully-appended images (in append order) so annotation token indices stay aligned. Adds a regression test (malformed first image must not shift the second image's tokenIndex). - Exclude sub-point font sizes (< MIN_CONTENT_FONT_SIZE = 1.0) from heading-size detection so a tiny vector watermark carrying many characters can't be picked as body text and drag the heading threshold toward zero (turning every line into a heading). Adds a unit test. - docling_parser_rest.py now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT instead of its three inline 612/792 literals, completing the centralization the constant's docstring described; docstring updated to match. Declined (tracked as a dedicated follow-up): extracting _build_image_token / _create_annotation to a shared base/util and the shared test helpers — those refactor the well-tested LlamaParse/Docling paths and belong in their own PR.
|
Good catches — addressed in 63dc0cc. Fixed
Declined (dedicated follow-up)
All changes Generated by Claude Code |
Code Review — LiteParseParser (#2074)This PR adds a solid third local PDF ingestion engine. The coordinate-system rationale is well-documented, the lazy import pattern is correct, and 1. Conversion failures blamed on LiteParse in logs (CONFIRMED)
Fix: move the 2. Double INFO log per parse invocation (CONFIRMED)
This PR adds Fix: remove the 3.
|
…dup log Address the eighth review pass on PR #2075: - _convert_result_to_opencontracts now runs in its own try/except, separate from the LiteParse import/parse step, so a failure in the pdfplumber token pass, shapely spatial query, or annotation assembly is logged as a conversion error rather than mislabelled "LiteParse parsing failed" (very different triage). Still returns None on failure. Adds a test for the conversion-failure path. - Removed the per-impl INFO log (and its now-unused redact import): the base BaseParser.parse_document already logs the redacted merged kwargs, so the impl log was a redundant second line per parse. - Commented the image-page guard to note an out-of-range page is normally a page past the target_pages range, or (rarely) a page pdfplumber omitted. Declined: a test enforcing _PDF_ONLY_PARSER_NAMES vs supported_file_types (those settings vars aren't exposed via django.conf.settings, and the in-code comment already flags the sync requirement) and converting test-fixture 612/792 to constants (they're concrete fixture dimensions, not references to the prod default).
|
Addressed in 57d1725:
Declined (rationale in the commit): a test enforcing Generated by Claude Code |
Code ReviewThis PR adds a solid third PDF ingestion engine — LiteParseParser — with good test coverage, lazy liteparse import for registry safety, and well-documented heuristics for heading detection and token mapping. The settings schema, fallback page-synthesis logic, and image token indexing are thoughtfully engineered. A few actionable issues below, ranked by severity. 1. Image annotation bounds are not clamped —
|
Address the ninth review pass on PR #2075: - Image annotations now route their bounds through _bounds_from_item, so a pdfplumber image whose stream bbox bleeds past the page edge is clamped to the page (and gets the >=1pt guarantee) exactly like text-line annotations, instead of storing out-of-page coordinates. Also removes the duplicated bounds-building. - Corrected the liteparse version pin. The placeholder >=0.1.0 was wrong: the parser is written against liteparse's 2.x Python API (LiteParse / ParseResult.pages / ParsedPage.text_items / TextItem fields), and the latest published release is 2.2.1 (the project shipped breaking 0.x -> 1.x -> 2.x bumps). Pin to >=2.2.1,<3.0.0 so a >=0.1.0 floor can't resolve to an incompatible 0.x/1.x and an --upgrade can't pull a breaking 3.x. (The review's suggested <0.2.0 cap was based on stale version info and would be uninstallable.)
|
Addressed in de867a5:
Declined (unchanged rationale — these refactor the well-tested LlamaParse/Docling paths or are parser-specific): backporting the malformed-image guard to LlamaParse, extracting Generated by Claude Code |
Code Review — LiteParse PDF Ingestion Engine (#2074)This PR adds Bug — Blank text items skew body-size detection, can silently disable all heading hierarchyFile:
Concrete failure: a PDF with 500 blank items at 14 pt (e.g. list bullets with no text, form fields, or repeated section-label placeholders) accumulates 500 weight at 14 pt; actual 12 pt body prose totals 300 chars. Fix: skip items whose stripped text is empty inside text = _attr(item, "text", "") or ""
if not text.strip():
continue # add this
weights[round(fs_f, 1)] += len(text.strip()) # no longer need max(…,1)Bug —
|
Address the tenth review pass on PR #2075: - _classify_heading_sizes weighted blank/whitespace-only items at 1 char each (the previous max(len,1) floor), but the annotation loop skips those same items. The two passes counted different populations, so a document with many empty lines at a non-body size could out-weigh real body prose and silently zero out heading detection. Skip blank items here too (and drop the floor) so both passes see the same population. Adds a regression test (200 blank 14pt items must not override 12pt body). Other items in the pass were declined (repeat DRY/style: shared label constants, DEFAULT_WIDTH aliases, MIN_CONTENT_FONT_SIZE location, shared test helpers; plus a speculative air-gap startup warning) or already handled (the malformed-image guard added earlier makes _append_image_tokens' loop exception-free, so the partial-mutation rollback concern can't occur).
Code ReviewThis PR adds 1. Bug: File: On footnote-heavy documents, footnote text (e.g. 9 pt) and body prose (e.g. 11 pt) can accumulate the same total character mass. In that tie, The fix is 2. Convention: Magic numbers in the test file (CLAUDE.md: "no magic numbers") File:
3. Cleanup: Redundant module-level aliases in File: These are local re-bindings of already-imported constants. Every other parser uses the imported names directly. The comment says "so the literal lives in exactly one place" but it already does — in 4. Minor: File: Unlike 5. Minor security note: File: The Summary: The |
…ilure) Address the eleventh review pass on PR #2075: - On an exact character-mass tie between two font sizes, _classify_heading_sizes now picks the LARGER tied size as body (max instead of min). This raises the heading threshold so the failure mode is under-detection (flat output, same as headings-off) rather than over-detection (real body paragraphs mislabelled as Section Headers) — the latter being the worse, more visible error on footnote-heavy documents. The tiebreak only affects exact ties, which occur on degenerate/short inputs; normal documents have a unique body maximum and are unaffected. Renamed/updated the tie test accordingly. (I'd previously defended the min() choice for the short-doc title-ties-body case, but max() is the better general default: it fails toward benign flat output.) Other items in the pass declined per the established rationale (test-fixture 612/792 literals, DEFAULT_WIDTH aliases, STRtree TYPE_CHECKING micro-opt with no real footprint benefit since shapely loads regardless, password PARSER_KWARGS comment — redaction already covers the logging risk).
|
On #1 — you're right that the tiebreak should favor the larger size; switched The other items I'm leaving as-is, with the same rationale given earlier: test-fixture Generated by Claude Code |
Review: Add LiteParse local PDF ingestion engine (#2074)This PR adds a well-structured third PDF backend ( 1.
|
|
No code change for this pass — evaluated all six, none are actionable:
Thanks for the confirmations on the non-issues (heading-stack ordering, redaction coverage, bounds clamping, float-vs-int). The PR is functionally complete; remaining suggestions are all cross-parser refactors better done in their own change. Generated by Claude Code |
Closes #2074
What
Adds
LiteParseParser— a third, fully-local PDF ingestion engine built on LiteParse (run-llama, PDFium-based with optional Tesseract OCR) — alongside the existing Docling and LlamaParse engines. It auto-registers via the pipeline registry and is selected withPDF_PARSER=liteparse.Following liteparse's actual API
Verified against liteparse's Python source (
packages/python/liteparse/types.py+parser.py):from liteparse import LiteParse;LiteParse(**opts).parse(bytes|path) -> ParseResultParseResult.pages→ParsedPage(page_num1-indexed, width, height, text, text_items);ParseResult.textTextItem(text, x, y, width, height, font_name, font_size, confidence)— absolute PDF points, top-left origin (same convention as pdfplumber), line-level, with no element-type labels or hierarchy in the structured output.Matching the core outputs of the other engines
Because liteparse's JSON gives line bboxes + text only, the parser augments it to produce the same surface (bboxes, tokens, feature labels, parent-child) as Docling/LlamaParse:
TextItemcoords (clamped; no fractional conversion needed)extract_pawls_tokens_from_pdf, same path as LlamaParse); each line bbox mapped to enclosed word tokens with a shapely spatial indexfont_size: modal size = body (Text Block); larger sizes ranked into heading levels →Title/Section Headerparent_id, fed toimport_annotations(self-FK) andbuild_subtree_groups_for_document(OC_SUBTREE_GROUPrelationships)extract_images_from_pdf, added to the unified token array + emitted asImageannotationsHeading detection is heuristic and can be disabled (
detect_headings=False), in which case output is flatText Blocks with no parent (matching LlamaParse's flat output).Changes
opencontractserver/pipeline/parsers/liteparse_parser.py— the new parser (lazyliteparseimport so registry discovery/startup don't require it).config/settings/base.py—LITEPARSE_*settings,_PDF_PARSER_MAP["liteparse"],PARSER_KWARGSentry, and a_SELECTED_OFFICE_PARSERfallback so DOCX/PPTX route to Docling when LiteParse (PDF-only) is selected. Default/test config is unchanged (the fallback only diverges whenPDF_PARSER=liteparse).requirements/ingestors/liteparse.txt— optional dependency.opencontractserver/tests/test_doc_parser_liteparse.py— covers token mapping, font-size feature labels, the parent-child hierarchy chain, headings-disabled, image extraction, error paths, configuration, and registry discovery.docs/pipelines/liteparse_parser.md+docs/pipelines/pipeline_overview.md— documentation.changelog.d/2074-liteparse.added.md— changelog fragment.Testing
black,isort,flake8, and the changelog-fragment checker pass on all changed files. The Docker/Postgres-backed Django suite could not be run in this environment (no Docker/Django available); the new tests mockliteparse.LiteParseand the pdfplumber-backed extraction utilities, and follow the existingtest_doc_parser_llamaparse.pypatterns.🤖 Generated with Claude Code
Generated by Claude Code