Skip to content

Add LiteParse local PDF ingestion engine (#2074)#2075

Open
JSv4 wants to merge 12 commits into
mainfrom
claude/liteparse-doc-ingestion-engine-wli0ff
Open

Add LiteParse local PDF ingestion engine (#2074)#2075
JSv4 wants to merge 12 commits into
mainfrom
claude/liteparse-doc-ingestion-engine-wli0ff

Conversation

@JSv4

@JSv4 JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Closes #2074

What

Adds LiteParseParser — a third, fully-local PDF ingestion engine built on LiteParse (run-llama, PDFium-based with optional Tesseract OCR) — alongside the existing Docling and LlamaParse engines. It auto-registers via the pipeline registry and is selected with PDF_PARSER=liteparse.

Following liteparse's actual API

Verified against liteparse's Python source (packages/python/liteparse/types.py + parser.py):

  • from liteparse import LiteParse; LiteParse(**opts).parse(bytes|path) -> ParseResult
  • ParseResult.pagesParsedPage(page_num 1-indexed, width, height, text, text_items); ParseResult.text
  • TextItem(text, x, y, width, height, font_name, font_size, confidence)absolute PDF points, top-left origin (same convention as pdfplumber), line-level, with no element-type labels or hierarchy in the structured output.

Matching the core outputs of the other engines

Because liteparse's JSON gives line bboxes + text only, the parser augments it to produce the same surface (bboxes, tokens, feature labels, parent-child) as Docling/LlamaParse:

Output How
Bounding boxes LiteParse TextItem coords (clamped; no fractional conversion needed)
Tokens Word-level PAWLs tokens via pdfplumber (extract_pawls_tokens_from_pdf, same path as LlamaParse); each line bbox mapped to enclosed word tokens with a shapely spatial index
Feature labels Derived from font_size: modal size = body (Text Block); larger sizes ranked into heading levels → Title / Section Header
Parent-child A heading stack walked in reading order sets parent_id, fed to import_annotations (self-FK) and build_subtree_groups_for_document (OC_SUBTREE_GROUP relationships)
Images Embedded images via extract_images_from_pdf, added to the unified token array + emitted as Image annotations

Heading detection is heuristic and can be disabled (detect_headings=False), in which case output is flat Text Blocks with no parent (matching LlamaParse's flat output).

Changes

  • opencontractserver/pipeline/parsers/liteparse_parser.py — the new parser (lazy liteparse import so registry discovery/startup don't require it).
  • config/settings/base.pyLITEPARSE_* settings, _PDF_PARSER_MAP["liteparse"], PARSER_KWARGS entry, and a _SELECTED_OFFICE_PARSER fallback so DOCX/PPTX route to Docling when LiteParse (PDF-only) is selected. Default/test config is unchanged (the fallback only diverges when PDF_PARSER=liteparse).
  • requirements/ingestors/liteparse.txt — optional dependency.
  • opencontractserver/tests/test_doc_parser_liteparse.py — covers token mapping, font-size feature labels, the parent-child hierarchy chain, headings-disabled, image extraction, error paths, configuration, and registry discovery.
  • docs/pipelines/liteparse_parser.md + docs/pipelines/pipeline_overview.md — documentation.
  • changelog.d/2074-liteparse.added.md — changelog fragment.

Testing

black, isort, flake8, and the changelog-fragment checker pass on all changed files. The Docker/Postgres-backed Django suite could not be run in this environment (no Docker/Django available); the new tests mock liteparse.LiteParse and the pdfplumber-backed extraction utilities, and follow the existing test_doc_parser_llamaparse.py patterns.

🤖 Generated with Claude Code


Generated by Claude Code

Add LiteParseParser, a fully-local PDF parser built on LiteParse
(run-llama/liteparse, PDFium + optional Tesseract OCR), as a third PDF
engine alongside Docling and LlamaParse.

LiteParse exposes line-level spatial text items (text + absolute PDF-point
bounding boxes, top-left origin, with font metadata) but no element types
or hierarchy. To match the other engines' core outputs the parser:

- maps each LiteParse line bbox to word-level PAWLs tokens extracted with
  pdfplumber (same path as LlamaParse), via a shapely spatial index;
- derives feature labels (Title / Section Header / Text Block) from font
  size (modal size = body; larger sizes ranked into heading levels);
- builds a parent-child hierarchy with a heading stack walked in reading
  order, setting parent_id (fed to import_annotations + subtree groups);
- extracts embedded images into the unified token array and emits Image
  annotations.

Select with PDF_PARSER=liteparse; DOCX/PPTX fall back to Docling since
LiteParse is PDF-only. liteparse is imported lazily so registry discovery
and startup do not require it. Configuration via LITEPARSE_* settings;
optional dependency in requirements/ingestors/liteparse.txt.

Adds tests (opencontractserver/tests/test_doc_parser_liteparse.py) and docs
(docs/pipelines/liteparse_parser.md).

Closes #2074
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds LiteParseParser, a fully-local PDF ingestion engine backed by LiteParse (PDFium). The implementation is well-structured — it follows the LlamaParseParser patterns correctly, all cross-file function signatures match, and the test suite is thorough. Four confirmed bugs and several cleanup items follow.


Confirmed Bugs

1. detect_headings and heading_size_ratio call-time kwarg overrides are silently dropped
opencontractserver/pipeline/parsers/liteparse_parser.py, lines 356–366

_parse_document_impl resolves 11 settings from all_kwargs but omits detect_headings and heading_size_ratio. Both _classify_heading_sizes (line 730) and the threshold calculation (line 756) read self.detect_headings / self.heading_size_ratio directly. The docstring explicitly lists both as supported overrides, so any call like parse_document(user_id=…, doc_id=…, detect_headings=False) produces headed output anyway — the kwarg is ignored.

Fix: extract both from all_kwargs in _parse_document_impl and thread them as parameters into _convert_result_to_opencontracts_classify_heading_sizes.


2. Four LITEPARSE_IMAGE_* env vars declared in Settings but never defined in base.py
opencontractserver/pipeline/parsers/liteparse_parser.py, lines 262–301 / config/settings/base.py

Settings declares env_var="LITEPARSE_IMAGE_FORMAT", LITEPARSE_IMAGE_QUALITY, LITEPARSE_MIN_IMAGE_WIDTH, LITEPARSE_MIN_IMAGE_HEIGHT — but none of these appear in base.py with an env.str()/env.int() definition (only LITEPARSE_IMAGE_MODE does). When migrate_pipeline_settings calls getattr(django_settings, "LITEPARSE_IMAGE_QUALITY", None) it gets None and silently uses the dataclass default. An operator who sets LITEPARSE_IMAGE_QUALITY=95 in .env gets no effect.

Fix: add the four missing env.* lines to base.py alongside the existing LITEPARSE_* block.


3. page_count is wrong when target_pages restricts parsing to a page subset
opencontractserver/pipeline/parsers/liteparse_parser.py, line 625 / opencontractserver/utils/pdf_token_extraction.py

When target_pages="3-5" is configured for a 10-page PDF, LiteParse returns 3 pages and the export dict is built with "page_count": 3. However, extract_pawls_tokens_from_pdf(pdf_bytes, page_dimensions) uses page_dimensions only for coordinate scaling, not as a page filter — it iterates enumerate(pdf.pages) for all 10 pages and returns a 10-entry pawls_pages list. BaseParser.save_parsed_data then unconditionally sets document.page_count = len(pawls_file_content) = 10, overwriting the export's page_count.

Fix: either pass a page filter to extract_pawls_tokens_from_pdf or derive page_count from len(pawls_pages) rather than len(liteparse_pages).


4. Degenerate-bbox guard is a no-op when an item falls exactly on the page edge
opencontractserver/pipeline/parsers/liteparse_parser.py, lines 811–819 (_bounds_from_item)

After clamping, if a text item's reported x equals page_width (malformed PDF or OCR artifact), both left and right are clamped to page_width. The minimum-size guard then executes right = min(left + 1, page_width) = min(page_width + 1, page_width) = page_width — unchanged, so the box remains zero-width. The same applies vertically at page_height. The resulting degenerate bbox passed to the STRtree spatial query finds no tokens, producing an annotation with an empty tokensJsons.

Fix: expand in the other direction when at the edge — e.g. right = page_width; left = max(0.0, page_width - 1) — or clamp after expanding rather than before.


Code Quality

5. __init__ copies all 16 Settings fields one-by-one — maintenance trap
liteparse_parser.py, lines 303–331

Every setting is defined in Settings and then explicitly re-assigned onto self. Adding a new setting requires edits in two places with no static check that they're in sync. (This pattern exists in LlamaParseParser too, so it's consistent, but still worth noting.) Accessing settings as self.settings.<field> throughout would eliminate the duplication.

6. DEFAULT_WIDTH = 612.0 / DEFAULT_HEIGHT = 792.0 are module-level magic numbers
liteparse_parser.py, lines 78–79

CLAUDE.md §4 ("No magic numbers") requires hardcoded values to live in opencontractserver/constants/. These US Letter point dimensions already appear in compact_pawls.py and docling_parser_rest.py; adding a third copy creates drift risk. They should be defined once in opencontractserver/constants/document_processing.py and imported everywhere.

7. liteparse>=0.1.0 is unbounded on an early-stage library
requirements/ingestors/liteparse.txt, line 3

The parser accesses 10+ attribute names (pages, page_num, text_items, font_size, ...) by string. The _attr() helper silently returns [] or None on any attribute rename — so a liteparse 1.0 that renames text_items to lines would silently produce 0 annotations per document with no exception raised. Consider pinning to a minor version (liteparse>=0.1.0,<0.2) until the library's API stabilises.


Overall this is solid work. Fixing bugs 1 and 2 are the highest-priority items before merge; 3 and 4 matter if target_pages or malformed PDFs are expected in production.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.56734% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ontractserver/pipeline/parsers/liteparse_parser.py 98.51% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

claude added 2 commits June 27, 2026 04:32
…e env vars

Address code-review findings on PR #2075:

- detect_headings / heading_size_ratio call-time kwargs were resolved in the
  docstring but dropped: _classify_heading_sizes read self.* directly. Thread
  both from _parse_document_impl through _convert_result_to_opencontracts into
  _classify_heading_sizes (falling back to instance settings when None) so the
  advertised per-call overrides actually take effect. Adds a regression test.
- Define the four LITEPARSE_IMAGE_* env vars (FORMAT/QUALITY/MIN_IMAGE_WIDTH/
  MIN_IMAGE_HEIGHT) in base.py so the Settings env_var declarations are honored
  by migrate_pipeline_settings instead of silently using dataclass defaults.
- Export page_count now uses len(pawls_file_content) (the value save_parsed_data
  persists) rather than the count of LiteParse-parsed pages, so the two agree
  when target_pages restricts parsing to a subset.
- _bounds_from_item: a text item pinned exactly to the page edge produced a
  zero-size box (the +1 guard clamped back to the edge); expand inward in that
  case so the bbox is always >=1pt. Adds a unit test.
- Move the fallback US-Letter page dimensions to
  constants/document_processing.py (DEFAULT_PDF_PAGE_WIDTH/HEIGHT) and import
  them, removing the module-level magic numbers (CLAUDE.md "no magic numbers").
Add tests exercising the parser's previously-uncovered fallback paths so the
codecov patch-coverage gate passes:

- token-extraction failure -> synthesized empty PAWLs pages
- image-extraction failure handled gracefully
- optional construct kwargs (ocr_server_url / target_pages / max_pages /
  password) forwarded to LiteParse only when set
- LiteParse.parse() exception -> returns None
- invalid page_num / width / height -> default dimensions
- full image metadata copied; out-of-range image page skipped
- non-numeric font_size / coordinates and inverted bbox in the helpers

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed in 29c0254 (fixes) and 18aea12 (coverage). Disposition of each item:

Fixed

  • 1 (heading kwargs dropped)detect_headings / heading_size_ratio are now resolved in _parse_document_impl and threaded through _convert_result_to_opencontracts_classify_heading_sizes (falling back to instance settings when None). Added a regression test asserting a detect_headings=False call-time kwarg overrides an instance configured with detection on.
  • 2 (missing image env vars) — added LITEPARSE_IMAGE_FORMAT / LITEPARSE_IMAGE_QUALITY / LITEPARSE_MIN_IMAGE_WIDTH / LITEPARSE_MIN_IMAGE_HEIGHT to base.py so the env_var declarations are honored by migrate_pipeline_settings.
  • 3 (page_count vs target_pages) — export page_count now uses len(pawls_file_content), the value save_parsed_data persists, so the two agree.
  • 4 (degenerate bbox at page edge)_bounds_from_item now expands inward when an item sits exactly on the page edge; added a unit test.
  • 6 (magic numbers) — moved the fallback US-Letter dimensions to constants/document_processing.py (DEFAULT_PDF_PAGE_WIDTH/HEIGHT) and import them. (Left the pre-existing copies in docling_parser_rest.py / compact_pawls.py untouched as out of scope for this PR; noted in the constant's docstring.)

Not changed (intentional)

  • 5 (field-by-field copy in __init__) — kept, to match the established LlamaParseParser / DoclingParser pattern; self.settings can be None (it falls back to Settings()), so the explicit copy is deliberate rather than incidental.
  • 7 (unbounded liteparse>=0.1.0) — agree the _attr() helper degrades quietly on an upstream rename. I left the version unpinned for now because I can't confirm liteparse's current published version from here, and a wrong <0.2 ceiling would make it uninstallable. Flagging for a maintainer to set the right upper bound against the actual release.

Also added tests covering the token-/image-extraction failure fallbacks and invalid-page-metadata paths to bring patch coverage back over the gate.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review: Add LiteParse local PDF ingestion engine (#2074)

This PR adds LiteParseParser — a fully-local PDFium-based PDF engine that augments LiteParse's line-level spatial output with word-level PAWLs tokens (pdfplumber), font-size-derived feature labels, and a parent-child heading hierarchy. The architecture mirrors LlamaParseParser closely, the test coverage is thorough (error paths, heading detection, image extraction, kwarg overrides), and the optional-import pattern correctly prevents startup failures when liteparse isn't installed. A few issues worth addressing before merge:


Bug — image_format passed without runtime validation; corrupt image storage on unsupported values

opencontractserver/pipeline/parsers/liteparse_parser.py ~line 698

image_format=cast(Literal["jpeg", "png"], self.image_format),

cast() is a mypy-only annotation — it performs zero runtime coercion. self.image_format is a plain str loaded from the database with no schema-level validation= callable. If an operator sets LITEPARSE_IMAGE_FORMAT=webp, the value passes through cast() unchecked into extract_images_from_pdf, which contains this branch in pdf_token_extraction.py:

if image_format == "jpeg":
    pil_image.save(img_bytes_io, format="JPEG", quality=jpeg_quality)
else:
    pil_image.save(img_bytes_io, format="PNG")   # "webp" falls here silently

The bytes are saved as PNG, but the token's "format": "webp" field and the storage path extension (.webp) reflect the unsupported value. Any consumer building a data:image/webp;base64,... URL from PNG-encoded bytes will get a broken image.

Fix: Clamp the value at use-time: image_format = self.image_format if self.image_format in ("jpeg", "png") else "jpeg", or add a validation callable to the Settings dataclass field.


Inconsistency — PARSER_KWARGS entry covers only 4 of 15 settings; 11 LITEPARSE_* env vars are dead letters

config/settings/base.py lines 1442–1447

"opencontractserver.pipeline.parsers.liteparse_parser.LiteParseParser": {
    "output_format": LITEPARSE_OUTPUT_FORMAT,
    "ocr_enabled": LITEPARSE_OCR_ENABLED,
    "detect_headings": LITEPARSE_DETECT_HEADINGS,
    "extract_images": LITEPARSE_EXTRACT_IMAGES,
},

Absent: LITEPARSE_DPI, LITEPARSE_NUM_WORKERS, LITEPARSE_TARGET_PAGES, LITEPARSE_MAX_PAGES, LITEPARSE_PASSWORD, LITEPARSE_OCR_LANGUAGE, LITEPARSE_OCR_SERVER_URL, LITEPARSE_IMAGE_MODE, LITEPARSE_HEADING_SIZE_RATIO, LITEPARSE_IMAGE_FORMAT, LITEPARSE_IMAGE_QUALITY, LITEPARSE_MIN_IMAGE_WIDTH, LITEPARSE_MIN_IMAGE_HEIGHT.

Compare LlamaParseParser's entry (lines ~1434–1441), which mirrors every LLAMAPARSE_* Django setting. For a deployment that hasn't run migrate_pipeline_settings (which seeds component_settings via the Settings dataclass schema), all 11 missing env vars have no effect — the parser falls back to dataclass defaults silently. Even when the management command has been run, the PARSER_KWARGS column is the canonical reference point for "what settings this parser supports" and having 11 holes is confusing to operators.

Fix: Add the remaining 11 keys, matching the LlamaParseParser pattern.


CLAUDE.md violation — "documents/" storage path prefix is a magic string duplicated in three parsers

opencontractserver/pipeline/parsers/liteparse_parser.py line 546
opencontractserver/pipeline/parsers/llamaparse_parser.py line 508
opencontractserver/pipeline/parsers/docling_parser_rest.py line 488

image_storage_path = f"documents/{document.pk}/images"

All three parsers hardcode the same "documents/" prefix (no shared constant). This is a CLAUDE.md "no magic numbers" violation — opencontractserver/constants/document_processing.py already had two new constants added in this PR (DEFAULT_PDF_PAGE_WIDTH, DEFAULT_PDF_PAGE_HEIGHT), so the pattern is established. A DOCUMENT_IMAGE_STORAGE_PREFIX = "documents" constant there would make it a single-source-of-truth and keep the three callers in sync. (The other two parsers predate this PR; fixing all three in one pass would be ideal.)


DRY — _create_annotation is a near-duplicate of LlamaParseParser._create_annotation

liteparse_parser.py line 867 vs llamaparse_parser.py line 1213

Both methods build an OpenContractsAnnotationPythonType with identical structure: page_annotation, content_modalities, annotation_json, and the fixed annotation_type=TOKEN_LABEL, structural=True. The only difference is that liteparse_parser.py adds a parent_id parameter (a functional superset). CLAUDE.md says "DRY — always architect code for maximal dryness." This utility could move to opencontractserver/pipeline/base/ as a module-level function and both parsers import it — the LlamaParser version would just pass parent_id=None.

Similarly, the fallback PAWLs page construction block (when extract_pawls_tokens_from_pdf raises) and the image token assembly block (PawlsTokenPythonType dict with optional metadata fields) are essentially identical across both parsers and would benefit from the same treatment.


Efficiency — _classify_item does two O(n) list scans per text item

liteparse_parser.py lines 818–819

if rounded in heading_sizes:         # O(n) membership
    level = heading_sizes.index(rounded)   # O(n) index scan

heading_sizes is a list; both operations are O(n) and they run for every text item in the document. The straightforward fix is to return a dict[float, int] (size → level) from _classify_heading_sizes alongside the sorted list, making _classify_item a single O(1) dict lookup:

level = heading_size_to_level.get(rounded)
if level is not None:
    label = LABEL_TITLE if level == 0 else LABEL_SECTION_HEADER
    return level, label
return None, LABEL_TEXT_BLOCK

For typical documents (2–4 heading levels) the absolute cost is small, but on a 500-page contract the current code does ~20 000 list scans instead of ~10 000 dict lookups.


Minor — sizes list in _classify_heading_sizes accumulates O(n_items) entries before Counter

liteparse_parser.py lines 781–789

sizes: list[float] = []
for page in pages:
    for item in ...:
        ...
        sizes.append(round(fs_f, 1))
counts = Counter(sizes)

sizes grows to one entry per text item across the entire document, then is discarded after Counter consumes it. Using Counter directly in the loop (counts: Counter[float] = Counter() / counts[round(fs_f, 1)] += 1) eliminates the intermediate list and keeps memory proportional to the number of distinct font sizes (typically 3–10) rather than the total number of text items.

…prefix

Address the second code-review pass on PR #2075:

- Clamp image_format to {jpeg,png} at use-time. cast() is a type-checker hint
  only; an operator value like "webp" would otherwise be saved as PNG bytes but
  tagged/extensioned "webp", producing broken data: URLs. Adds a test.
- Expand the LiteParseParser PARSER_KWARGS entry to every setting
  _parse_document_impl actually resolves per-call (dpi, num_workers,
  target_pages, max_pages, password, ocr_language, ocr_server_url, image_mode,
  heading_size_ratio). image_* stay out because they are read from resolved
  settings in _append_image_tokens, not per-call (mirrors LlamaParseParser).
- Add DOCUMENT_IMAGE_STORAGE_PREFIX to constants/document_processing.py and use
  it in the liteparse, llamaparse, and docling parsers (identical output),
  removing the triplicated "documents/" literal.
- _classify_item now takes a size->level dict for an O(1) lookup instead of two
  O(n) list scans per text item; the dict is built once per document.
- _classify_heading_sizes counts font sizes into a Counter incrementally rather
  than materialising an O(n_items) list first.

Declined (out of scope for this PR): extracting _create_annotation and the
fallback-page / image-token blocks into a shared base helper would refactor the
well-tested LlamaParseParser path; tracked as a follow-up.

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the second pass — addressed in d3304eb.

Fixed

  • image_format not runtime-validated — clamped to {jpeg,png} at use-time in _append_image_tokens (cast is type-checker-only, as you noted); a webp value now falls back to jpeg instead of writing PNG bytes tagged .webp. Added a test asserting the clamp.
  • PARSER_KWARGS holes — expanded the entry to every setting _parse_document_impl actually resolves per-call (dpi, num_workers, target_pages, max_pages, password, ocr_language, ocr_server_url, image_mode, heading_size_ratio, plus the original four). I intentionally left image_format/image_quality/min_image_* out: those are read from the resolved settings inside _append_image_tokens, not per-call, so listing them in PARSER_KWARGS would imply an override that doesn't fire — this matches LlamaParseParser, whose image settings are likewise absent from its PARSER_KWARGS entry (so the "mirrors every LLAMAPARSE_*" comparison isn't quite accurate). Operators still configure those via env → PipelineSettings.
  • "documents/" magic string ×3 — added DOCUMENT_IMAGE_STORAGE_PREFIX to constants/document_processing.py and used it in the liteparse, llamaparse, and docling parsers (byte-identical output).
  • O(n) scans in _classify_item_classify_heading_sizes now also yields a size → level map (built once in the converter) so per-item classification is an O(1) dict lookup.
  • intermediate sizes list_classify_heading_sizes accumulates into a Counter directly; memory is now proportional to distinct font sizes.

Declined (out of scope)

  • Extract _create_annotation / fallback-page / image-token blocks to pipeline/base — a real duplication, but doing it means refactoring the well-tested LlamaParseParser path (and its image/figure handling, which is a functional superset in places). That's a cross-cutting change I'd rather not fold into a "add an engine" PR; better as a dedicated follow-up so the shared helper can be reviewed against both parsers' full behavior.

All changes are black/isort/flake8-clean.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds a solid, well-structured LiteParseParser with good test coverage — the heading-stack logic, bounding-box clamping, and image token offset arithmetic are all correct on the normal execution path. A few issues worth addressing before merge:


Bug — fallback pawls_pages misindexed in _append_image_tokens

opencontractserver/pipeline/parsers/liteparse_parser.py, lines 530–542 and 718/745

When extract_pawls_tokens_from_pdf fails, the fallback builds pawls_pages as a compacted list ordered by sorted(page_dimensions):

for page_idx in sorted(page_dimensions):   # e.g. [1, 2, 3, 4] for target_pages="2-5"
    pawls_pages.append(...)
# pawls_pages[0] → page 1, pawls_pages[1] → page 2, …

But _append_image_tokens then indexes it by absolute 0-based PDF page number (from extract_images_from_pdf, which always iterates the full PDF via range(num_pages)):

pawls_pages[page_idx]["tokens"].append(unified_token)   # page_idx = 0, 1, 2, …

With target_pages="2-5": page_idx=1 writes into slot 1, which holds the page-2 entry, not the page-1 entry. Images from unrelated pages land in wrong entries; images from the actual target pages are silently dropped (they fall within range(len(pawls_pages)) but map to wrong slots). The guard page_idx >= len(pawls_pages) doesn't catch this.

The bug only fires when pdfplumber token extraction raises AND target_pages is set to a non-zero-starting range — unusual but not impossible. The normal path (pdfplumber succeeds) is not affected because then pawls_pages has one entry per full-document page in natural order.

Fix: in _append_image_tokens, either filter raw_images_by_page to only the keys present in page_dimensions before appending, or build a {page_idx: list_position} map at the time the fallback list is constructed.


DRY — llamaparse_parser.py didn't adopt the new shared constants

opencontractserver/pipeline/parsers/llamaparse_parser.py, lines 425–426

This PR adds DEFAULT_PDF_PAGE_WIDTH = 612.0 and DEFAULT_PDF_PAGE_HEIGHT = 792.0 to opencontractserver/constants/document_processing.py, explicitly to centralise these magic numbers. But llamaparse_parser.py still defines its own local shadows:

DEFAULT_WIDTH = 612
DEFAULT_HEIGHT = 792

These should be replaced by imports of the new constants, which is exactly what the PR aimed to achieve. (compact_pawls.py and docling_parser_rest.py also still inline 612/792, as the new constant's own docstring acknowledges, but those are pre-existing.)


DRY — label strings duplicated across parsers

opencontractserver/pipeline/parsers/liteparse_parser.py, lines 76–79

LABEL_TITLE = "Title", LABEL_SECTION_HEADER = "Section Header", LABEL_TEXT_BLOCK = "Text Block", LABEL_IMAGE = "Image" are declared here, and the same string values appear in llamaparse_parser.py's ELEMENT_TYPE_MAPPING (lines 75–82). A shared constants module (opencontractserver/constants/annotations.py) or extension of the existing annotation constants would give these a single source of truth.


DRY — patch_parser_settings test helper defined twice

opencontractserver/tests/test_doc_parser_liteparse.py, line 154

patch_parser_settings is structurally identical to the same helper in test_doc_parser_llamaparse.py (line 74) — same pattern of building a mock settings object and copying attributes onto the parser instance. Extracting it to a shared test utility (e.g., opencontractserver/tests/utils/parser_test_helpers.py) would avoid a third copy when the next parser arrives.


Minor — redundant module-level aliases

opencontractserver/pipeline/parsers/liteparse_parser.py, lines 84–85

DEFAULT_WIDTH = DEFAULT_PDF_PAGE_WIDTH
DEFAULT_HEIGHT = DEFAULT_PDF_PAGE_HEIGHT

These are used internally but add an extra name for the same value. The imported constants from document_processing.py can be used directly at the seven call sites inside the file.

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds LiteParseParser — a third, fully-local PDF ingestion engine backed by LiteParse/PDFium. The architecture is solid: lazy import so the registry loads without the optional dependency, _attr() duck-typing for LiteParse dataclass/dict compatibility, document-wide font-size classification for heading detection, a cross-page heading stack, and fallback pdfplumber token extraction. Test coverage is thorough and mirrors the existing test_doc_parser_llamaparse.py patterns.

Three issues worth addressing before merge.


1. _append_image_tokens silently drops all images in the fallback path when target_pages is set

File: opencontractserver/pipeline/parsers/liteparse_parser.py ~line 1027

if page_idx >= len(pawls_pages) or not page_images:
    continue
token_offset = len(pawls_pages[page_idx].get("tokens", []))

pawls_pages is indexed by list position (0, 1, 2…). In the happy path, pdfplumber processes all pages, so pawls_pages[page_idx] aligns perfectly with the physical page index. But in the fallback path (when extract_pawls_tokens_from_pdf raises), pawls_pages is rebuilt from sorted(page_dimensions.keys()) — only the targeted pages:

# fallback build — compacted list of targeted pages only
pawls_pages = []
for page_idx in sorted(page_dimensions):   # e.g. [2, 4] for target_pages='3,5'
    pawls_pages.append({...})              # len == 2, not 10

extract_images_from_pdf independently walks the whole PDF and returns physical 0-based page indices (2 and 4 here). The guard 2 >= 2True → skip; 4 >= 2True → skip. All images are silently dropped with no log warning or error.

Fix: replace the list-length guard with a page-to-list-position map (analogous to page_index_by_pos already used for text items):

# build alongside pawls_pages in the fallback:
pawls_page_list_pos = {phys_idx: list_pos for list_pos, phys_idx in enumerate(sorted(page_dimensions))}

# then in _append_image_tokens:
list_pos = pawls_page_list_pos.get(page_idx)
if list_pos is None or not page_images:
    continue
token_offset = len(pawls_pages[list_pos].get("tokens", []))
pawls_pages[list_pos]["tokens"].append(unified_token)

2. LITEPARSE_DPI is not forwarded to extract_images_from_pdf

File: opencontractserver/pipeline/parsers/liteparse_parser.py ~line 694

dpi is resolved early (dpi = all_kwargs.get("dpi", self.dpi)) and passed to the LiteParse constructor for page rasterization. _append_image_tokens passes every other image-related setting (min_width, min_height, image_format, jpeg_quality, storage_path) but omits dpi:

raw_images_by_page = extract_images_from_pdf(
    pdf_bytes,
    min_width=self.min_image_width,
    min_height=self.min_image_height,
    image_format=image_format,
    jpeg_quality=self.image_quality,
    storage_path=image_storage_path,
    # dpi=self.dpi  ← missing
)

extract_images_from_pdf has dpi: int = DEFAULT_IMAGE_EXTRACTION_DPI (defaulting to 150). An operator who sets LITEPARSE_DPI=300 expecting higher-fidelity embedded images will get 150 DPI regardless, with no diagnostic. (LlamaParse has the same omission, so this may be intentional — but if so it's worth a comment explaining why the two DPI settings are deliberately decoupled.)


3. Annotation label strings belong in opencontractserver/constants/ (CLAUDE.md rule 4)

File: opencontractserver/pipeline/parsers/liteparse_parser.py lines 385–388

LABEL_TITLE = "Title"
LABEL_SECTION_HEADER = "Section Header"
LABEL_TEXT_BLOCK = "Text Block"
LABEL_IMAGE = "Image"

These same string values appear as inline literals in llamaparse_parser.py (the ELEMENT_TYPE_MAPPING dict at lines 72–79) and in docling_parser_rest.py. CLAUDE.md rule 4 ("no magic numbers — use constants files") requires shared values to live in opencontractserver/constants/. The PR correctly applies this rule to DEFAULT_PDF_PAGE_WIDTH, DEFAULT_PDF_PAGE_HEIGHT, and DOCUMENT_IMAGE_STORAGE_PREFIX — the label strings deserve the same treatment. A typo or rename in one parser diverges silently from the others, breaking any downstream consumer (corpus search, ML model, label filter) that matches by name.

Suggested fix: add these four constants to opencontractserver/constants/document_processing.py and import them into all three parsers.


Minor note

The test test_detect_headings_disabled_via_call_kwarg (and several others) correctly constructs a level_by_size dict rather than passing a raw heading_sizes list to _classify_item — good to confirm the test–implementation contract is consistent.

Address the third/fourth review passes on PR #2075:

- Fallback PAWLs page synthesis (used when pdfplumber token extraction raises)
  built a *compacted* list ordered by sorted(page_dimensions), breaking the
  "list position == absolute page index" invariant the success path guarantees.
  With target_pages restricting parsing to a non-zero-starting / non-contiguous
  range, _append_image_tokens (which indexes by absolute page number) then wrote
  images into the wrong page entries or dropped them. Rebuild the fallback over
  range(max_idx + 1), filling gaps, so positions always align. Adds a regression
  test (single parsed page at absolute index 1).
- llamaparse_parser.py now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT instead of its
  own local 612/792 literals, completing the centralization this PR introduced.
- Document why LiteParse's render DPI (self.dpi) is intentionally not forwarded
  to extract_images_from_pdf (it uses the pipeline-wide extraction DPI default,
  matching Docling/LlamaParse).

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks — addressing the last two review passes together (they overlap) in cb3c0bb.

Fixed

  • Fallback page misindexing with target_pages (flagged in both passes) — confirmed real. The fallback page synthesis (used only when pdfplumber token extraction raises) compacted the list by sorted(page_dimensions), breaking the "list position == absolute page index" invariant the success path guarantees, so _append_image_tokens (which indexes by absolute page number) wrote images into wrong entries or dropped them once target_pages started at a non-zero / non-contiguous index. Fixed at the root: the fallback now builds over range(max_idx + 1), filling gaps, so positions always align in both paths (this also covers the '3,5' → keys {2,4} case). Added a regression test with a single parsed page at absolute index 1.
  • llamaparse local 612/792 — now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT, completing the centralization this PR introduced.

Clarified (no behavior change)

  • LITEPARSE_DPI not forwarded to extract_images_from_pdf — by design: LITEPARSE_DPI is LiteParse's own page-render DPI; embedded-image extraction uses the pipeline-wide IMAGE_EXTRACTION_DPI default, exactly as Docling and LlamaParse do (neither couples its parser DPI to image extraction). Added a comment documenting the decoupling rather than changing behavior and diverging from the other two parsers.

Declined (cross-cutting; better as a dedicated follow-up)

  • Centralize the structural-label strings ("Title", "Section Header", "Text Block", "Image") — these are already named module constants here, not inline literals. A true single-source-of-truth means defining the canonical structural-label vocabulary and migrating all parsers — LlamaParse's ELEMENT_TYPE_MAPPING has ~17 such labels and Docling's come from the microservice. Centralizing only the 4 that happen to overlap would leave LlamaParse's mapping half-literal/half-constant (worse, not better). Worth doing as its own change against all parsers.
  • Shared patch_parser_settings test helper — extracting it would mean editing the pre-existing test_doc_parser_llamaparse.py (different field set); per the repo's "don't touch old tests without permission" guidance I'd rather leave it and revisit when a third parser test lands.

All changes are black/isort/flake8-clean.


Generated by Claude Code

The mypy pre-commit hook flagged four calls that deliberately pass malformed
types (None / "bad" / "big") into the typed make_page/make_item test helpers to
exercise the parser's defensive coercion paths. Mark those four calls with
# type: ignore[arg-type]; the helpers stay strictly typed for the normal calls.
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds LiteParseParser — a third, fully-local PDF ingestion engine using LiteParse (PDFium/Tesseract) alongside the existing Docling and LlamaParse engines. The implementation is well-structured: lazy import keeps the registry functional without the optional dependency, the _attr() helper makes unit testing easy without mocking real dataclasses, and the fallback PAWLs-page synthesis for the target_pages edge case is thoughtfully handled. The test suite is comprehensive. A few issues worth addressing before merge:


1. PDF password exposed in logs — config/settings/base.py:1457 + opencontractserver/pipeline/base/parser.py:87

PARSER_KWARGS includes "password": LITEPARSE_PASSWORD. BaseParser.parse_document (line 87) logs the full merged_kwargs dict before calling _parse_document_impl, so the unredacted password appears in Django logs at INFO level every time a password-protected PDF is parsed. The redact_sensitive_kwargs call inside _parse_document_impl is too late — the leak already happened.

Fix: Either remove password from PARSER_KWARGS entirely (read it only from self.password in _parse_document_impl, not as a call-time kwarg override), or apply redact_sensitive_kwargs in BaseParser.parse_document before the logger.info call — and do the same for any other parser that exposes secrets this way.


2. Orphaned image tokens when target_pages is set — liteparse_parser.py:720–761

extract_images_from_pdf is called with the full pdf_bytes and returns images for every page in the PDF. _append_image_tokens then writes those image tokens into pawls_pages for all pages that pass page_idx < len(pawls_pages). But the main annotation loop only iterates over LiteParse-returned pages (for pos, page in enumerate(pages)), which with target_pages="5-10" covers only pages 4–9.

Result: image tokens for pages 0–3 are silently written into pawls_pages with no corresponding Image annotation. The frontend receives a PAWLs export that has image tokens on pages it cannot annotate, which can confuse token-index consumers.

Fix: Pass the set of LiteParse-active page indices into _append_image_tokens and skip (or defer) image insertion for pages outside that set, or restrict the extract_images_from_pdf call to only the parsed page range.


3. body_size = min() can misclassify body text as headings — liteparse_parser.py:822

The comment explains the intent: "body text is the smallest of the common sizes." This handles the case where a heading size and body size tie in frequency. However, it silently misclassifies documents with frequent footnote or caption text at a smaller size than body. For example: 10 items at 8 pt (footnotes), 10 items at 12 pt (body), 5 items at 18 pt (headings). min() picks 8 pt as body; threshold = 9.6 pt; all 12 pt body lines become Section Header, and all 5 headings plus 10 body lines are misclassified.

A more robust heuristic: pick the size with the highest total character count (or weighted sum), not just the highest frequency. Alternatively, pick the most-frequent size among sizes that are not outliers on the large end.

This is admittedly heuristic territory, but the current min() choice has a non-obvious failure mode that the comment doesn't mention.


4. _build_image_token duplicated — liteparse_parser.py:742–761 vs llamaparse_parser.py:1186

LlamaParseParser already has _build_image_token as a static method (line 1186) that copies required fields and conditionally copies optional metadata into a PawlsTokenPythonType. LiteParseParser._append_image_tokens inlines the exact same 20-line pattern. There are now three copies (the third is in docling_parser_rest.py:541–566).

_build_image_token should be extracted to a shared utility (e.g., opencontractserver/utils/pdf_token_extraction.py) and called by all three parsers. As written, any new PawlsTokenPythonType field must be added in three places.


5. _create_annotation duplicated — liteparse_parser.py:896 vs llamaparse_parser.py:1219

Both parsers have a private _create_annotation method that builds OpenContractsAnnotationPythonType with the same structure (bounds, tokensJsons, rawText, annotationLabel, page, annotation_json, parent_id, annotation_type, structural, content_modalities). The only difference is that LiteParseParser adds a parent_id parameter while LlamaParseParser hardcodes None. This is another shared utility candidate — the same note about _build_image_token applies here.


6. Annotation label strings not shared — liteparse_parser.py:76–79

LABEL_TITLE = "Title", LABEL_SECTION_HEADER = "Section Header", LABEL_TEXT_BLOCK = "Text Block" are defined fresh here, but the same strings already appear as values in LlamaParseParser.ELEMENT_TYPE_MAPPING (lines 77–84). Per CLAUDE.md's DRY rule, these belong in a shared constants file (opencontractserver/constants/annotations.py or a new pipeline_labels.py) so a rename only needs one edit. Currently both parsers would need to be updated independently.


Findings ranked most-severe first. Items 1–3 are actionable bugs; items 4–6 are DRY/maintainability issues that are reasonable to address now while the new code is fresh.

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review: LiteParseParser (#2074)

This PR adds a clean third local PDF ingestion engine built on LiteParse. The overall structure — lazy import, settings dataclass, pdfplumber token mapping, shapely spatial index for bbox→token lookup, heading stack for parent-child hierarchy — follows the LlamaParse parser's patterns well. The test suite is thorough. A few issues below, ordered by severity.


1. Heading detection silently fails for heading-heavy documents [HIGH]

File: opencontractserver/pipeline/parsers/liteparse_parser.py (~line 329)

_classify_heading_sizes treats the most frequent font size as body text. If heading-level text outnumbers body text — e.g., a contract with 10 repeated exhibit-header lines at 16pt and only 5 body sentences at 11pt — then counts = {16.0: 10, 11.0: 5}, max_count = 10, body_size = 16.0, threshold = 19.2, and heading_sizes = []. Every line becomes a flat Text Block with no hierarchy — heading detection silently produces nothing.

The code comment acknowledges the tie case ("on short documents a heading size can tie body for frequency") and applies a min() tiebreak — but when headings genuinely outnumber body text (no tie), the tiebreak does not fire.

Common real-world triggers: slide decks, tables of contents, legal exhibits with repeated section labels. Suggest capping body-size detection at the smallest highly-frequent size rather than the absolute modal, or using a frequency-weighted approach.


2. Orphaned image tokens on mid-loop exception in _append_image_tokens [MEDIUM]

File: opencontractserver/pipeline/parsers/liteparse_parser.py (~line 763)

pawls_pages[page_idx]["tokens"].append(unified_token) mutates the list in-place inside the try block. If extract_images_from_pdf succeeds for all pages but an exception is raised mid-loop on a later page (e.g., a malformed img_data raises KeyError on img_data["x"]), the except handler resets images_by_page = {} and image_token_offsets = {} but cannot roll back already-appended tokens in pawls_pages. The export then contains is_image=True tokens in the PAWLs array with no corresponding Image annotation.

Consider building a local token buffer per page and writing to pawls_pages only after the full page succeeds, or moving the exception boundary inside the per-page loop so partial pages can be skipped individually.


3. Image tokens written but no Image annotations for pages outside target_pages [MEDIUM]

File: opencontractserver/pipeline/parsers/liteparse_parser.py (~line 578)

When target_pages="3" is set, LiteParse returns only page 3 (page_idx=2). But extract_images_from_pdf(pdf_bytes) processes the full PDF (pdfplumber does not respect target_pages), so image tokens are appended to pawls_pages[0], pawls_pages[1], etc. However, the annotation loop only iterates LiteParse's returned pages, so Image annotations are never created for pages 0 and 1. Result: orphaned is_image=True tokens in the PAWLs array with no annotation pointing to them.

This only occurs when target_pages restricts parsing. Worth noting in a comment, or restricting extract_images_from_pdf to only the pages LiteParse processed.


4. _SELECTED_OFFICE_PARSER hardcodes the string "liteparse" [MEDIUM — design]

File: config/settings/base.py (~line 1322)

_SELECTED_OFFICE_PARSER = (
    _PDF_PARSER_MAP["docling"]
    if PDF_PARSER.lower() == "liteparse"
    else _SELECTED_PDF_PARSER
)

This works correctly today but will silently break the moment a second PDF-only parser is added — its DOCX/PPTX entries would fall through to _SELECTED_PDF_PARSER (the new parser) instead of DoclingParser.

The cleaner approach is to resolve the selected parser class and check its supported_file_types — each parser already declares this. That guard would be self-maintaining as new parsers are added.


5. _create_annotation is duplicated; the copy in LlamaParseParser silently drops parent_id [LOW]

Files: opencontractserver/pipeline/parsers/liteparse_parser.py (~line 896), llamaparse_parser.py (~line 1219)

Both classes define structurally identical _create_annotation methods. The only difference: LiteParseParser accepts and passes through parent_id, while LlamaParseParser hardcodes "parent_id": None. This means LlamaParse-parsed documents can never have parent-child hierarchy — probably intentional given LlamaParse's flat output — but the gap is masked by the duplication. A shared helper on BaseParser (with parent_id=None default) would surface the LlamaParse limitation explicitly.


6. image_format, image_quality, min_image_* absent from PARSER_KWARGS [LOW]

File: config/settings/base.py (~line 1439)

All other LiteParse settings flow through PARSER_KWARGS and can be overridden per-call via **all_kwargs. These four image-extraction settings are intentionally excluded (they are read from self.* at __init__ time from PipelineSettings). The PR comment explains this above the block, but an operator reading PARSER_KWARGS alone would not see it. A short inline comment noting these are PipelineSettings-only with no per-call override would prevent confusion.


7. sys.modules.setdefault creates a dead else branch in test cleanup [LOW]

File: opencontractserver/tests/test_doc_parser_liteparse.py (~line 1044)

Because sys.modules.setdefault("liteparse", _mock_liteparse) runs at module import time, sys.modules.get("liteparse") inside test_import_error_returns_none always returns a truthy object. The else: sys.modules.pop("liteparse", None) branch is dead code — structurally unreachable. No actual test-isolation problem, just noise to remove.


Minor: magic number defaults repeated across three locations [LOW]

Values like dpi=150, image_quality=85, min_image_width=50, num_workers=4, heading_size_ratio=1.2 appear as literals in (a) the Settings dataclass defaults, (b) config/settings/base.py env.int(…, default=…) calls, and (c) create_mock_liteparse_settings() in the test helper. Per CLAUDE.md ("no magic numbers — use constants files"), these belong in opencontractserver/constants/. Also: DEFAULT_IMAGE_EXTRACTION_DPI = 150 already exists in opencontractserver/utils/pdf_token_extraction.py — the liteparse parser re-declares the same value instead of importing it.

…hardening

Address the fifth/sixth review passes on PR #2075:

- SECURITY: BaseParser.parse_document logged the merged kwargs unredacted at
  INFO before the per-parser impl could redact, leaking SECRET settings (the new
  LiteParse PDF password, and pre-existing LlamaParse api_key) into logs. Redact
  via redact_sensitive_kwargs in the base log line — fixes every parser.
- Heading detection now weights font sizes by CHARACTER MASS, not line
  frequency. Body prose dominates a document's characters, so this correctly
  identifies body text even when heading-style lines outnumber it (slide decks,
  TOCs, repeated exhibit labels) or when small footnotes are as frequent as body
  — cases the previous most-frequent + min() heuristic misclassified. Updated
  the hierarchy test to realistic line lengths and added a unit test for the
  heading-heavy/footnote scenario.
- _append_image_tokens skips image dicts missing required keys, so a malformed
  entry can't raise mid-loop and strand partial image tokens the except block
  can't roll back (the append mutates pawls_pages in place).
- Office-format fallback now keys off a named _PDF_ONLY_PARSER_NAMES set instead
  of an inline "liteparse" string, so adding another PDF-only engine stays a
  one-line change rather than a silent DOCX-routing bug.
- Documented that, like the word-token pass, image extraction spans the whole
  PDF; with target_pages the full token layer is present but only parsed pages
  are annotated (harmless — per-page token indices).
- Removed a dead else branch in the import-error test cleanup.

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks — addressing the last two review rounds together (they overlap) in e9c8eee.

Fixed

  • 🔒 PDF password / API key leaking into logs (good catch) — BaseParser.parse_document logged the merged kwargs unredacted at INFO before the per-parser impl could redact, exposing SECRET settings (the new LiteParse password, and the pre-existing LlamaParse api_key). Now redacted via redact_sensitive_kwargs in the base log line, which fixes it for every parser at once.
  • Heading detection on heading-heavy / footnote-heavy docs (raised in both rounds) — switched body-size detection from line frequency to character mass. Body prose dominates a document's characters, so this correctly picks body even when heading-style lines outnumber it (slide decks, TOCs, repeated exhibit labels) or when small footnotes are as frequent as body — the failure modes of the old most-frequent + min() tiebreak. Updated the hierarchy test to realistic line lengths and added a unit test for the heading-heavy/footnote case.
  • Image-token exception safety_append_image_tokens now skips image dicts missing required keys, so a malformed entry can't raise mid-loop and strand partial tokens the except can't roll back (the append mutates pawls_pages in place).
  • Office-format fallback future-proofing — replaced the inline == "liteparse" check with a named _PDF_ONLY_PARSER_NAMES set, so adding another PDF-only engine is a one-line change rather than a silent DOCX-misrouting bug. (Introspecting each parser's supported_file_types here isn't viable — settings load before the app registry/models are importable.)
  • Dead else branch in the import-error test cleanup — removed.

Clarified, no behavior change

  • Image tokens on pages outside target_pages — documented in a comment: like the word-token pass, image extraction spans the whole PDF, so with target_pages the full token layer (text and image) is present but only parsed pages are annotated. Unreferenced tokens are harmless (token indices are per-page, so nothing shifts). Restricting images but not text would make the two inconsistent.
  • image_* absent from PARSER_KWARGS — already has an explanatory comment above that block (added in an earlier commit): those are read from resolved settings in _append_image_tokens, not per-call.

Declined (dedicated follow-up)

  • Extract _create_annotation / _build_image_token to a shared base/util — real duplication, but unifying them correctly means reconciling per-parser variants (e.g. Docling's image_type="cropped" default, LlamaParse's figure handling) and re-running all three engines' suites. Better as a focused refactor PR than as collateral in a feature PR.
  • Centralize structural-label strings and the dataclass/env/test default trio — the labels need a canonical cross-parser vocabulary (LlamaParse's mapping has ~17), and the default-value trio is the intended settings-schema shape (dataclass default is canonical; env/test mirror it) shared by every parser. Both are cross-cutting and out of scope here.

All changes black/isort/flake8-clean.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds a well-structured LiteParseParser — a fully-local, PDFium-based PDF ingestion engine — alongside a comprehensive test suite, settings plumbing, and documentation. The approach (lazy import, heuristic heading detection, pdfplumber for word-level tokens) follows the existing LlamaParseParser patterns and integrates cleanly with the pipeline registry. A few issues worth addressing before merge:


Bug — Image token tokenIndex is wrong when any image dict is skipped

opencontractserver/pipeline/parsers/liteparse_parser.py line 778 vs. line 653

_append_image_tokens skips malformed image dicts that are missing required keys (line 753–754) but still stores the full raw list (including the skipped entries) in images_by_page[page_idx] at line 778:

# Line 753-754: guard that skips malformed images
if not all(k in img_data for k in ("x", "y", "width", "height")):
    continue
pawls_pages[page_idx]["tokens"].append(unified_token)  # only appended if well-formed
...
images_by_page[page_idx] = page_images  # ALL images, including skipped ones

Then in _convert_result_to_opencontracts, the annotation loop enumerates that full list and computes token_offset + img_idx:

for img_idx, img_data in enumerate(page_images):  # img_idx counts skipped ones too
    ...
    token_refs=[{"pageIndex": page_idx, "tokenIndex": token_offset + img_idx}]

If image 0 is skipped, image 1 lands at pawls_pages slot token_offset + 0, but its annotation says tokenIndex = token_offset + 1. Every image after the first skipped one on a given page gets a tokenIndex that points at the wrong token slot — the frontend would highlight the wrong region (or nothing if the index is out of range).

Fix: Store only the successfully-appended images in images_by_page, or maintain a separate counter for appended images to use in place of img_idx.


docling_parser_rest.py still inlines 612/792 — the constants file acknowledges this gap

opencontractserver/constants/document_processing.py line 42-43

The PR adds DEFAULT_PDF_PAGE_WIDTH = 612.0 and DEFAULT_PDF_PAGE_HEIGHT = 792.0 to the constants module and even adds a comment saying those constants exist to replace the inline literals that "docling_parser_rest.py ... still inline". But docling_parser_rest.py was only partially updated: it got DOCUMENT_IMAGE_STORAGE_PREFIX but not the page dimensions. Three inline literals remain:

# docling_parser_rest.py
width = float(page_info.get("width", 612))   # line 636
height = float(page_info.get("height", 792)) # line 637
page_width, page_height = page_dims.get(page_idx, (612, 792))  # line 713

CLAUDE.md: "No magic numbers — we have constants files… Use them for any hardcoded values." The constants module's own comment naming docling_parser_rest.py signals this was intended to be fixed in this PR.


DRY — _build_image_token already exists in LlamaParseParser

opencontractserver/pipeline/parsers/liteparse_parser.py lines 758–777

LlamaParseParser has a dedicated _build_image_token() static method (around line 1186 of llamaparse_parser.py) that assembles the unified PawlsTokenPythonType dict with the same required-fields-first, optional-metadata-guarded pattern. LiteParseParser._append_image_tokens() re-implements the identical block inline instead of calling it. This method is parser-agnostic (it only touches PawlsTokenPythonType); it belongs in BaseParser or a shared utility so a new optional field in PawlsTokenPythonType doesn't require updates in two places.


DRY — _create_annotation() duplicated between LiteParseParser and LlamaParseParser

opencontractserver/pipeline/parsers/liteparse_parser.py lines 925–973; llamaparse_parser.py lines ~1219–1291

Both parsers define a nearly identical _create_annotation() instance method. The only structural difference is that the LiteParse version adds a parent_id parameter for hierarchy. This is a natural candidate for BaseParser — either as a shared method or a mixin — especially since the annotation dict shape is part of the pipeline contract.


PLAUSIBLE — Micro-font body text sets heading threshold near zero

opencontractserver/pipeline/parsers/liteparse_parser.py lines 847–851

body_size = min(size for size, w in weights.items() if w == max_weight)
threshold = body_size * heading_size_ratio

The guard at line 836 excludes fs_f <= 0, but permits arbitrarily small positive sizes (e.g. 0.1pt from a vector watermark). If such a micro-font size carries the most characters (possible in a document where LiteParse extracts only OCR-confirmed text while a vector watermark at 0.1pt has many letters), threshold = 0.1 * 1.2 = 0.12 — every legible line becomes a heading. The heading stack would grow unboundedly across the document. A minimum body-size floor (e.g. max(body_size, 1.0)) before computing the threshold would prevent this.


Minor — test_doc_parser_liteparse.py duplicates create_mock_token_extraction_result

opencontractserver/tests/test_doc_parser_liteparse.py lines 83–123; test_doc_parser_llamaparse.py lines 94–145

Both test files define a create_mock_token_extraction_result function producing the same 6-tuple structure. If extract_pawls_tokens_from_pdf's return signature changes, both helpers must be updated independently. Consider extracting to a shared test utility (e.g. opencontractserver/tests/utils/parser_test_utils.py).

…e constants

Address the seventh review pass on PR #2075:

- BUG (regression from the prior commit's malformed-image guard): _append_image_tokens
  skipped malformed images when appending tokens but still stored the full raw
  list in images_by_page, so the annotation loop's token_offset + img_idx pointed
  past the real slot for every image after a skipped one. Store only the
  successfully-appended images (in append order) so annotation token indices stay
  aligned. Adds a regression test (malformed first image must not shift the
  second image's tokenIndex).
- Exclude sub-point font sizes (< MIN_CONTENT_FONT_SIZE = 1.0) from heading-size
  detection so a tiny vector watermark carrying many characters can't be picked
  as body text and drag the heading threshold toward zero (turning every line
  into a heading). Adds a unit test.
- docling_parser_rest.py now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT instead of its
  three inline 612/792 literals, completing the centralization the constant's
  docstring described; docstring updated to match.

Declined (tracked as a dedicated follow-up): extracting _build_image_token /
_create_annotation to a shared base/util and the shared test helpers — those
refactor the well-tested LlamaParse/Docling paths and belong in their own PR.

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Good catches — addressed in 63dc0cc.

Fixed

  • Image tokenIndex misalignment when an image is skipped — correct, and it was a regression from the malformed-image guard I added in the previous commit: I skipped malformed images when appending tokens but still stored the full raw list in images_by_page, so token_offset + img_idx pointed past the real slot for every image after a skipped one. Now I store only the successfully-appended images (in append order), so annotation indices stay aligned. Added a regression test (malformed first image must not shift the second image's tokenIndex).
  • Micro-font watermark — excluded sub-point sizes (< MIN_CONTENT_FONT_SIZE = 1.0) from heading-size detection so a 0.1pt watermark with many characters can't be picked as body and drag the threshold to ~0. (Note: the suggested max(body_size, 1.0) floor on the threshold wouldn't fix it — real 11pt text would still exceed 1.2 and be classed a heading; filtering the watermark out of body candidacy is the actual fix.) Added a unit test.
  • docling 612/792 — now imports DEFAULT_PDF_PAGE_WIDTH/HEIGHT for all three inline literals, completing what the constant's docstring described; docstring updated.

Declined (dedicated follow-up)

  • Extracting _build_image_token / _create_annotation to a shared base/util, and the shared create_mock_token_extraction_result / patch_parser_settings test helpers — all real duplication, but each refactors the well-tested LlamaParse/Docling paths (or their test files), so they belong in a focused refactor PR that can reconcile the per-parser variants and re-run all three engines' suites, rather than as collateral churn here.

All changes black/isort/flake8-clean.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review — LiteParseParser (#2074)

This PR adds a solid third local PDF ingestion engine. The coordinate-system rationale is well-documented, the lazy import pattern is correct, and redact_sensitive_kwargs is a good catch. Six findings below, two confirmed and four plausible.


1. Conversion failures blamed on LiteParse in logs (CONFIRMED)

opencontractserver/pipeline/parsers/liteparse_parser.py, outer try block in _parse_document_impl

_convert_result_to_opencontracts(...) is called inside the same try/except Exception block that guards the from liteparse import LiteParse import and parser.parse() call. Any exception raised during conversion — pdfplumber OOM on a 500-page PDF, a shapely TopologicalError in find_tokens_in_bbox, a NumPy allocation failure — will be caught and logged as "LiteParse parsing failed". Operators cannot tell from the log whether LiteParse itself failed or whether the subsequent token/annotation conversion did, making triaging indefinitely retrying jobs painful.

Fix: move the _convert_result_to_opencontracts call outside the try block (or wrap it in its own try/except with a distinct log message).


2. Double INFO log per parse invocation (CONFIRMED)

opencontractserver/pipeline/parsers/liteparse_parser.py ~line 725-729

This PR adds redact_sensitive_kwargs(merged_kwargs) logging to BaseParser.parse_document() before dispatching to _parse_document_impl — correct. But LiteParseParser._parse_document_impl opens with its own second logger.info(... redact_sensitive_kwargs(all_kwargs) ...). Every parse emits two INFO lines with identical kwargs. No other parser (Docling, LlamaParse) has a per-impl entry log now that the base class handles it.

Fix: remove the from opencontractserver.utils.logging import redact_sensitive_kwargs import and the second log call inside _parse_document_impl.


3. _PDF_ONLY_PARSER_NAMES is a manually maintained set with no enforcement (PLAUSIBLE)

config/settings/base.py, _PDF_ONLY_PARSER_NAMES = {"liteparse"}

The comment acknowledges that supported_file_types cannot be introspected at settings-load time, so the PDF-only set is maintained manually. This is fine today with a single entry, but if a future developer adds a second PDF-only engine to _PDF_PARSER_MAP and forgets to add its key here, _SELECTED_OFFICE_PARSER silently equals the PDF-only parser, and every DOCX/PPTX upload gets routed to an engine that cannot read it — with no runtime warning.

Suggestion: add a comment that this set must be kept in sync with the parsers' supported_file_types, and add a test (or a manage.py check) that, for every key in _PDF_PARSER_MAP, imports the parser class and verifies that if FileTypeEnum.PDF is the only supported type, the key is in _PDF_ONLY_PARSER_NAMES.


4. Silent image loss when pdfplumber page count is shorter than expected (PLAUSIBLE)

opencontractserver/pipeline/parsers/liteparse_parser.py, _append_image_tokens ~line 1113

if page_idx >= len(pawls_pages):
    logger.warning(...)
    continue

The warning is correct, but this path is reachable in a real failure mode: if extract_pawls_tokens_from_pdf internally catches per-page exceptions and silently omits the failed page (it does — pdfplumber's page loop can raise on individual pages), pawls_pages may have fewer entries than the PDF's total page count. When extract_images_from_pdf then iterates over all pages and finds images on the omitted page, those images are silently dropped. The warning does fire, but no error is raised and the final export gives no indication that images are missing.

This is a pre-existing risk in the pdfplumber utility functions, not new to this PR — but since LiteParseParser's fallback path (synthetic empty pages) depends on this invariant too, a comment noting the assumption would help future readers.


5. Magic literals 612 / 792 in the test helper (CLAUDE.md violation)

opencontractserver/tests/test_doc_parser_liteparse.py, make_page helper and call sites

The test file's make_page helper constructs page dicts with hardcoded width=612, height=792, instead of importing DEFAULT_PDF_PAGE_WIDTH / DEFAULT_PDF_PAGE_HEIGHT from opencontractserver.constants.document_processing — the exact constants this PR introduces. CLAUDE.md: "No magic numbers — we have constants files". If the default page-size constant is ever changed, the test assertions silently diverge from production behaviour.


6. target_pages restricts LiteParse but not pdfplumber extraction (PLAUSIBLE)

opencontractserver/pipeline/parsers/liteparse_parser.py ~lines 938-939

When an operator sets target_pages="1-5" on a 200-page PDF for performance reasons, LiteParse correctly parses only those 5 pages. But extract_pawls_tokens_from_pdf and extract_images_from_pdf are called with the full pdf_bytes and no page-range filter — they open and process all 200 pages. The comment at line ~561 notes this is "harmless," but it means target_pages delivers no latency or memory benefit from the pdfplumber passes, which dominate cost on large documents. At minimum, a warning should clarify that target_pages scopes LiteParse output only.


Minor: PARSER_KWARGS omits image_format/image_quality/min_image_* (documented, noted for visibility)

The PR comment explains these are instance-only settings. This is fine but creates an asymmetric API where extract_images can be overridden per-call but image format/quality cannot. This is consistent with LlamaParseParser's behaviour so it is not a regression, but a short in-code comment at the PARSER_KWARGS block explaining why these four are absent would help future maintainers who wonder if they were forgotten.

…dup log

Address the eighth review pass on PR #2075:

- _convert_result_to_opencontracts now runs in its own try/except, separate from
  the LiteParse import/parse step, so a failure in the pdfplumber token pass,
  shapely spatial query, or annotation assembly is logged as a conversion error
  rather than mislabelled "LiteParse parsing failed" (very different triage).
  Still returns None on failure. Adds a test for the conversion-failure path.
- Removed the per-impl INFO log (and its now-unused redact import): the base
  BaseParser.parse_document already logs the redacted merged kwargs, so the impl
  log was a redundant second line per parse.
- Commented the image-page guard to note an out-of-range page is normally a
  page past the target_pages range, or (rarely) a page pdfplumber omitted.

Declined: a test enforcing _PDF_ONLY_PARSER_NAMES vs supported_file_types (those
settings vars aren't exposed via django.conf.settings, and the in-code comment
already flags the sync requirement) and converting test-fixture 612/792 to
constants (they're concrete fixture dimensions, not references to the prod
default).

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed in 57d1725:

  • Conversion vs parse failures_convert_result_to_opencontracts now runs in its own try/except so a pdfplumber/shapely/assembly failure is logged distinctly instead of as "LiteParse parsing failed". Added a test for the conversion-failure path.
  • Double INFO log — removed the redundant per-impl log (and its unused import); BaseParser.parse_document already logs the redacted merged kwargs.
  • Commented the image-page out-of-range guard.

Declined (rationale in the commit): a test enforcing _PDF_ONLY_PARSER_NAMES vs supported_file_types (those private settings vars aren't exposed via django.conf.settings; the in-code comment already flags the sync requirement), and converting test-fixture 612/792 to constants (concrete fixture dimensions, not references to the production default).


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds a solid third PDF ingestion engine — LiteParseParser — with good test coverage, lazy liteparse import for registry safety, and well-documented heuristics for heading detection and token mapping. The settings schema, fallback page-synthesis logic, and image token indexing are thoughtfully engineered. A few actionable issues below, ranked by severity.


1. Image annotation bounds are not clamped — liteparse_parser.py ~line 653

Text-line annotations pass through _bounds_from_item, which clamps every coordinate to [0, page_width] × [0, page_height] and guarantees a ≥1pt box. Image annotations at lines 653–662 read coordinates directly from img_data with float(img_data.get("x", 0)) and no clamping:

image_bounds: BoundingBoxPythonType = {
    "left": img_x,
    "top": img_y,
    "right": img_x + img_w,
    "bottom": img_y + img_h,
}

If pdfplumber returns an image whose stream bounding box slightly exceeds the page (e.g. x=610, width=50 on a 612pt page → right=660), the annotation is stored with out-of-bounds coordinates. The frontend renderer expects all annotation coordinates within the page and may mis-render or throw. Fix: route image bounds through _bounds_from_item, or inline the same clamp.


2. LlamaParseParser inline image loop is missing the malformed-image guard — llamaparse_parser.py ~line 542

The new LiteParseParser._append_image_tokens explicitly guards each image dict before accessing required keys:

if not all(k in img_data for k in ("x", "y", "width", "height")):
    continue

…and the PR even includes a regression test for this (test_skipped_malformed_image_keeps_token_indices_aligned). The pre-existing LlamaParseParser loop at line 542–566 uses direct dict access (img_data["x"]) with no such guard. A single malformed image dict causes a KeyError, caught by the outer except at line 570, which resets images_by_page = {} and silently drops all extracted images for the entire document. Since the PR introduces the correct pattern and tests it, it would be worth backporting the guard to the LlamaParse loop in the same change.


3. _create_annotation is duplicated in both parsers; LlamaParse's copy hardcodes parent_id=Nonellamaparse_parser.py:1282

LiteParseParser._create_annotation (liteparse_parser.py:955–1003) and LlamaParseParser._create_annotation (llamaparse_parser.py:1219–1291) are ~90% identical. The only functional difference is that LlamaParse's copy hardcodes "parent_id": None rather than accepting it as a parameter. This means LlamaParse can never emit the parent-child hierarchy that the base layer (import_annotations self-FK + build_subtree_groups_for_document) already supports. If this method were moved to BaseParser with parent_id=None as a default, both parsers would share it and LlamaParse would get hierarchy support for free. At minimum, the duplication means a future field addition must be applied in both places.


4. _append_image_tokens reinvents LlamaParseParser._build_image_token inline — liteparse_parser.py:785–804

LlamaParseParser already has a static _build_image_token method (llamaparse_parser.py:1186–1217) that does exactly what the inline block in _append_image_tokens does: build a PawlsTokenPythonType with 7 required fields and 5 conditional metadata fields. The PR duplicates this logic rather than calling the existing helper (or moving it to BaseParser / a utility function). The two copies are already slightly inconsistent — the LlamaParse inline version at line 544 uses img_data["x"] (direct access), while the LiteParse version uses img_data.get("x", 0) (safe access). Any future schema change to the image token dict (e.g. a new metadata key) must be applied in two places.


5. Open-ended liteparse>=0.1.0 version constraint on a pre-stable library — requirements/ingestors/liteparse.txt:3

liteparse is at v0.1.x, a pre-stable release where semver permits breaking API changes between minor versions. The uncapped constraint means a pip install --upgrade can silently install 0.2.0 that renames ParseResult.pages or changes TextItem field names. Because the parser uses _attr(result, "pages") (attribute-access), a rename would trigger the guard at line 426 ("LiteParse returned no pages") — looking like a document-specific parse failure rather than a library API change, which is hard to diagnose in production. Consider pinning to liteparse>=0.1.0,<0.2.0 (or a tested upper bound) until the library reaches a stable 1.0.


6. MIN_CONTENT_FONT_SIZE belongs in the shared constants module — liteparse_parser.py:92

The PR itself introduces DEFAULT_PDF_PAGE_WIDTH/HEIGHT and DOCUMENT_IMAGE_STORAGE_PREFIX in opencontractserver/constants/document_processing.py with the note that literals should live in one place (CLAUDE.md "no magic numbers"). MIN_CONTENT_FONT_SIZE = 1.0 (the floor below which a font size is treated as a non-content watermark) lives in the parser file instead. Minor inconsistency with the precedent the PR itself establishes, and the same threshold could be relevant to future parsers.


Non-blocking notes

  • The character-mass weighting in _classify_heading_sizes (body = size with the most characters, not the most lines) is a good design choice for slide-deck / ToC-heavy documents, and the test coverage for it is thorough.
  • The separate try/except blocks for parse vs. conversion (_parse_document_impl lines 426–456 vs. 459–467) with distinct log messages is a nice triage aid.
  • The fallback PAWLs page synthesis correctly fills gaps for non-zero-starting target_pages ranges (verified by test_fallback_pages_keep_absolute_index_for_images).

Address the ninth review pass on PR #2075:

- Image annotations now route their bounds through _bounds_from_item, so a
  pdfplumber image whose stream bbox bleeds past the page edge is clamped to the
  page (and gets the >=1pt guarantee) exactly like text-line annotations, instead
  of storing out-of-page coordinates. Also removes the duplicated bounds-building.
- Corrected the liteparse version pin. The placeholder >=0.1.0 was wrong: the
  parser is written against liteparse's 2.x Python API (LiteParse /
  ParseResult.pages / ParsedPage.text_items / TextItem fields), and the latest
  published release is 2.2.1 (the project shipped breaking 0.x -> 1.x -> 2.x
  bumps). Pin to >=2.2.1,<3.0.0 so a >=0.1.0 floor can't resolve to an
  incompatible 0.x/1.x and an --upgrade can't pull a breaking 3.x. (The review's
  suggested <0.2.0 cap was based on stale version info and would be
  uninstallable.)

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed in de867a5:

  • Image annotation bounds not clamped (real bug) — image annotations now build bounds via _bounds_from_item, so an image whose pdfplumber stream bbox bleeds past the page edge is clamped to the page (same as text lines) instead of storing out-of-page coordinates. Also dedups the bounds-building.
  • Version pin⚠️ correcting the record: liteparse's latest published release is 2.2.1, not 0.1.x (it shipped breaking 0.x → 1.x → 2.x bumps). The suggested <0.2.0 cap would be uninstallable. More to the point, my own placeholder >=0.1.0 was wrong — this parser targets the 2.x Python API (LiteParse / ParseResult.pages / ParsedPage.text_items / TextItem fields), so I've pinned >=2.2.1,<3.0.0.

Declined (unchanged rationale — these refactor the well-tested LlamaParse/Docling paths or are parser-specific): backporting the malformed-image guard to LlamaParse, extracting _create_annotation / _build_image_token to a shared base, and moving MIN_CONTENT_FONT_SIZE (a liteparse-only heading-heuristic constant, not a cross-parser value) to the shared module. The duplications are tracked for a dedicated follow-up PR.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review — LiteParse PDF Ingestion Engine (#2074)

This PR adds LiteParseParser as a third, fully-local PDF engine built on LiteParse/PDFium. The architecture is solid — lazy import for optional dependency, a well-structured Settings dataclass, sensible fallback for DOCX/PPTX, shared constants extraction, and password redaction in the base-class log line. The test suite is comprehensive. Below are actionable issues found.


Bug — Blank text items skew body-size detection, can silently disable all heading hierarchy

File: opencontractserver/pipeline/parsers/liteparse_parser.py, _classify_heading_sizes

_classify_heading_sizes gives every text item weight max(len(text.strip()), 1), meaning blank/whitespace-only lines contribute weight 1 per item. The annotation loop later skips those same lines (if not item_text: continue). These two passes operate on different populations.

Concrete failure: a PDF with 500 blank items at 14 pt (e.g. list bullets with no text, form fields, or repeated section-label placeholders) accumulates 500 weight at 14 pt; actual 12 pt body prose totals 300 chars. body_size is picked as 14.0, threshold = 14.0 × 1.2 = 16.8, no size exceeds the threshold, heading_sizes = [], and every annotation gets LABEL_TEXT_BLOCK regardless of font size — the structural hierarchy is completely lost with no warning.

Fix: skip items whose stripped text is empty inside _classify_heading_sizes, mirroring the annotation loop:

text = _attr(item, "text", "") or ""
if not text.strip():
    continue   # add this
weights[round(fs_f, 1)] += len(text.strip())   # no longer need max(…,1)

Bug — _append_image_tokens exception handler can't roll back partial PAWLs mutations

File: opencontractserver/pipeline/parsers/liteparse_parser.py, _append_image_tokens (~line 1170)

pawls_pages[page_idx]["tokens"].append(unified_token) mutates the shared list in place. If an exception fires after page 0's image tokens are appended but before all pages are processed, the except block resets images_by_page = {} and image_token_offsets = {} — but cannot undo the already-written tokens in pawls_pages[0]. The caller creates zero Image annotations, but the PAWLs file has an orphaned image token with no annotation pointing to it. Additionally, the summary log at the end of _convert_result_to_opencontracts will count is_image tokens as non-zero while reporting zero image annotations — a misleading discrepancy.

Fix: collect all per-page data into a local accumulator and flush to pawls_pages only after the full loop succeeds, or save and restore each page's token list length in the except block.


Cleanup — Structural label strings duplicated across parsers

Files: liteparse_parser.py lines 76–79 and llamaparse_parser.py lines 77–84 + several inline literals

"Title", "Section Header", "Text Block", and "Image" are module-level constants in liteparse_parser.py but scattered as hardcoded values inside ELEMENT_TYPE_MAPPING, default arguments, and inline strings in llamaparse_parser.py. There is no shared source of truth. A one-character discrepancy (e.g. "Section header" vs. "Section Header") would silently produce label mismatches in downstream label-matching logic.

Fix: move these four strings to opencontractserver/constants/annotations.py (or a new pipeline_labels.py) and import from there in both parsers.


Cleanup — DEFAULT_WIDTH/DEFAULT_HEIGHT module-level aliases are pure noise

File: opencontractserver/pipeline/parsers/liteparse_parser.py, lines 84–85

DEFAULT_WIDTH = DEFAULT_PDF_PAGE_WIDTH   # alias adds indirection without value
DEFAULT_HEIGHT = DEFAULT_PDF_PAGE_HEIGHT

The comment at line 83 already explains the provenance. The aliases prevent grep from connecting call sites to the canonical constant. Use DEFAULT_PDF_PAGE_WIDTH/DEFAULT_PDF_PAGE_HEIGHT directly throughout the module.


Cleanup — MIN_CONTENT_FONT_SIZE belongs in the constants file

File: opencontractserver/pipeline/parsers/liteparse_parser.py, line 92

MIN_CONTENT_FONT_SIZE = 1.0 is a domain threshold but is defined at module level in the parser rather than in opencontractserver/constants/document_processing.py, which already houses DEFAULT_PDF_PAGE_WIDTH and DEFAULT_PDF_PAGE_HEIGHT. Per CLAUDE.md ("No magic numbers — use the constants files"), move it there alongside the other PDF processing constants.


Cleanup — Duplicate test helpers should be shared

File: opencontractserver/tests/test_doc_parser_liteparse.py, lines 83–178

create_mock_token_extraction_result and patch_parser_settings are near-identical copies of functions in test_doc_parser_llamaparse.py. Any change to the extract_pawls_tokens_from_pdf 6-tuple signature or the settings loading mechanism requires updating both files independently. Extract these into a shared helpers module (e.g. opencontractserver/tests/pipeline/helpers.py) imported by both test files.


Note — _SELECTED_OFFICE_PARSER → Docling fallback is noisier for air-gapped deployments

File: config/settings/base.py, lines 1325–1338

When PDF_PARSER=liteparse, DOCX/PPTX now route to DoclingParser. In a Docling-free deployment this results in a connection error to DOCLING_PARSER_SERVICE_URL rather than the previous quiet None return (which happened because document.pdf_file is empty for a DOCX). Both paths ultimately fail, so this is not a regression in outcome, but the noisier failure mode may surprise operators who chose LiteParse specifically to avoid Docling. A startup-time warning log when PDF_PARSER=liteparse would help set expectations.

Address the tenth review pass on PR #2075:

- _classify_heading_sizes weighted blank/whitespace-only items at 1 char each
  (the previous max(len,1) floor), but the annotation loop skips those same
  items. The two passes counted different populations, so a document with many
  empty lines at a non-body size could out-weigh real body prose and silently
  zero out heading detection. Skip blank items here too (and drop the floor) so
  both passes see the same population. Adds a regression test (200 blank 14pt
  items must not override 12pt body).

Other items in the pass were declined (repeat DRY/style: shared label constants,
DEFAULT_WIDTH aliases, MIN_CONTENT_FONT_SIZE location, shared test helpers; plus
a speculative air-gap startup warning) or already handled (the malformed-image
guard added earlier makes _append_image_tokens' loop exception-free, so the
partial-mutation rollback concern can't occur).
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR adds LiteParseParser, a third local PDF ingestion engine built on LiteParse (PDFium + optional Tesseract). The overall structure is solid: lazy import of liteparse, comprehensive tests, clean _attr helper for dict/dataclass dual-path, good bounds clamping, and a well-thought-out fallback for Office formats. The refactor extracting DEFAULT_PDF_PAGE_WIDTH/HEIGHT and DOCUMENT_IMAGE_STORAGE_PREFIX into shared constants is a clear improvement. A few issues worth addressing before merge:


1. Bug: _classify_heading_sizes min() tiebreak misclassifies body text as headings

File: opencontractserver/pipeline/parsers/liteparse_parser.py, line 878

body_size = min(size for size, w in weights.items() if w == max_weight)

On footnote-heavy documents, footnote text (e.g. 9 pt) and body prose (e.g. 11 pt) can accumulate the same total character mass. In that tie, min() picks the footnote size as body_size, setting the heading threshold to 9 * 1.2 = 10.8. Every 11 pt body paragraph then satisfies 11 > 10.8 and is emitted as a Section Header — headings proliferate across the document.

The fix is max() instead of min(). When two sizes tie, the larger is more likely to be body prose; picking it keeps the heading threshold above actual body text. The docstring comment says "prefer the SMALLEST tied size (body is smaller than headings)" but the reasoning is inverted — body IS smaller than headings, which is exactly why the tiebreak should pick the LARGER tied size as body. The existing test test_classify_heading_sizes_tie_breaks_to_smaller_body doesn't catch this because its tied items are 1-char strings that don't represent the footnote scenario.


2. Convention: Magic numbers in the test file (CLAUDE.md: "no magic numbers")

File: opencontractserver/tests/test_doc_parser_liteparse.py, many lines (e.g. 104, 113, 226, 299, ...)

612 and 792 appear repeatedly in make_page calls and create_mock_token_extraction_result. This same PR adds DEFAULT_PDF_PAGE_WIDTH = 612.0 and DEFAULT_PDF_PAGE_HEIGHT = 792.0 to opencontractserver/constants/document_processing.py precisely to avoid these literals. The test file should import and use them.


3. Cleanup: Redundant module-level aliases in liteparse_parser.py

File: opencontractserver/pipeline/parsers/liteparse_parser.py, lines 84-85

DEFAULT_WIDTH = DEFAULT_PDF_PAGE_WIDTH
DEFAULT_HEIGHT = DEFAULT_PDF_PAGE_HEIGHT

These are local re-bindings of already-imported constants. Every other parser uses the imported names directly. The comment says "so the literal lives in exactly one place" but it already does — in document_processing.py. Use DEFAULT_PDF_PAGE_WIDTH/DEFAULT_PDF_PAGE_HEIGHT at the call sites directly.


4. Minor: STRtree import should be under TYPE_CHECKING

File: opencontractserver/pipeline/parsers/liteparse_parser.py, line 42

from shapely.strtree import STRtree

Unlike llamaparse_parser.py (which constructs STRtree objects directly), this parser delegates all tree construction to extract_pawls_tokens_from_pdf and consumes the pre-built trees via find_tokens_in_bbox. STRtree appears only as a type annotation (spatial_indices: dict[int, STRtree]). Guard it under TYPE_CHECKING to keep the runtime footprint minimal.


5. Minor security note: password in PARSER_KWARGS bypasses DB-encrypted pathway

File: config/settings/base.py, line ~1460

The password field uses SettingType.SECRET in the Settings dataclass (encrypted DB storage). But PARSER_KWARGS is a plain Python dict that bypasses that pathway — when the DB settings row is absent (first run, migration, testing), the password from LITEPARSE_PASSWORD env var travels as a plain string in merged_kwargs. Log redaction via redact_sensitive_kwargs catches it ("password" is in SENSITIVE_KEY_PATTERNS), so this is low-severity. A comment noting that the PARSER_KWARGS entry is the pre-DB-migration fallback path would help future maintainers understand why the SettingType.SECRET annotation is still meaningful.


Summary: The min()->max() tiebreak fix (finding 1) is the most important — it can cause systematic mislabelling on footnote-heavy legal documents, which is the core use case. The rest are cleanup and convention items. The architecture, lazy imports, heading stack, bounds clamping, and image token alignment are all well-designed.

…ilure)

Address the eleventh review pass on PR #2075:

- On an exact character-mass tie between two font sizes, _classify_heading_sizes
  now picks the LARGER tied size as body (max instead of min). This raises the
  heading threshold so the failure mode is under-detection (flat output, same as
  headings-off) rather than over-detection (real body paragraphs mislabelled as
  Section Headers) — the latter being the worse, more visible error on
  footnote-heavy documents. The tiebreak only affects exact ties, which occur on
  degenerate/short inputs; normal documents have a unique body maximum and are
  unaffected. Renamed/updated the tie test accordingly.

(I'd previously defended the min() choice for the short-doc title-ties-body case,
but max() is the better general default: it fails toward benign flat output.)

Other items in the pass declined per the established rationale (test-fixture
612/792 literals, DEFAULT_WIDTH aliases, STRtree TYPE_CHECKING micro-opt with no
real footprint benefit since shapely loads regardless, password PARSER_KWARGS
comment — redaction already covers the logging risk).

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

On #1 — you're right that the tiebreak should favor the larger size; switched min()max() in 28c8dd3 (reversing the choice I'd defended earlier). The precise reason is the failure mode: on an exact char-mass tie, picking the smaller size lowers the threshold and risks over-detecting headings (real body paragraphs → Section Headers), whereas picking the larger size raises it and fails toward under-detection (flat output, same as headings-off) — the benign error. It only affects exact ties (degenerate/short inputs); normal docs have a unique body maximum. Updated/renamed the tie test.

The other items I'm leaving as-is, with the same rationale given earlier: test-fixture 612/792 are concrete fixture dimensions (not refs to the prod default); the DEFAULT_WIDTH alias is a deliberate local-brevity binding (and LlamaParse uses the same one, so it's not inconsistent); STRtree under TYPE_CHECKING saves nothing since shapely is imported regardless; and the password logging risk is already handled by redact_sensitive_kwargs. The remaining DRY extractions stay tracked for a dedicated follow-up.


Generated by Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review: Add LiteParse local PDF ingestion engine (#2074)

This PR adds a well-structured third PDF backend (LiteParseParser) using LiteParse/PDFium for fully-local parsing. The heading-detection heuristic, heading-stack parent-child wiring, and fallback PAWLs synthesis when pdfplumber fails are all implemented correctly. The constant extraction (DEFAULT_PDF_PAGE_WIDTH/HEIGHT, DOCUMENT_IMAGE_STORAGE_PREFIX) and the redact_sensitive_kwargs plumbing in BaseParser are positive housekeeping moves. A few issues worth addressing before merge:


1. numpy imported for a type annotation — breaks module load without numpy

opencontractserver/pipeline/parsers/liteparse_parser.py, line 40

import numpy as np

np is used only in one annotation (dict[int, np.ndarray]) at the point where extract_pawls_tokens_from_pdf's return value is destructured. No numpy operations are called directly. In any environment where numpy is not installed, the module fails to import entirely — blocking the pipeline registry from discovering LiteParseParser at startup — even though numpy is never exercised at runtime.

Fix: replace the annotation with dict[int, Any] and drop the numpy import. llamaparse_parser.py has the same pattern on line 40 and would benefit from the same fix.


2. Silent image data loss when target_pages + pdfplumber token extraction both fail

opencontractserver/pipeline/parsers/liteparse_parser.py, ~line 419–450 and ~line 755

When extract_pawls_tokens_from_pdf raises and target_pages restricts parsing to a non-terminal range (e.g. '5-7' on a 10-page PDF), page_dimensions contains only keys {4, 5, 6} and the fallback pawls_pages is built as range(max_idx + 1) = range(7) — indices 0–6. _append_image_tokens then runs pdfplumber over the full PDF, producing page_idx values 0–9. The guard:

if page_idx >= len(pawls_pages):
    ...  # silently skipped

drops images on pages 7–9 with no log entry. On a normal run (pdfplumber succeeds) all N pages are present and the guard never fires, so this only surfaces in the double-failure case. At minimum a logger.warning(...) should be emitted when an image is silently dropped.


3. _create_annotation near-verbatim duplicate of llamaparse_parser.py:1219–1291

opencontractserver/pipeline/parsers/liteparse_parser.py, ~lines 960–1008

The two methods are already diverging: liteparse wires parent_id through; llamaparse hard-codes it to None. A future change to OpenContractsAnnotationPythonType's shape must be applied in two places. The same is true for the image-token construction block (_append_image_tokens lines 782–801 vs. llamaparse_parser.py:543–566); llamaparse_parser.py even has _build_image_token (line 1186) that was not reused here.

Consider lifting _create_annotation (and _build_image_token) into BaseParser or a shared parser utility module so both parsers can import it. CLAUDE.md rule 2 (DRY) applies.


4. Redundant module-level aliases for shared constants

opencontractserver/pipeline/parsers/liteparse_parser.py, lines 84–85

DEFAULT_WIDTH = DEFAULT_PDF_PAGE_WIDTH
DEFAULT_HEIGHT = DEFAULT_PDF_PAGE_HEIGHT

These are immediately re-bound from the already-imported DEFAULT_PDF_PAGE_WIDTH / DEFAULT_PDF_PAGE_HEIGHT. Every call site in the file uses the short aliases, forcing a reader to trace through two levels of indirection to find the authoritative value. CLAUDE.md rule 2 (DRY) and rule 1 (no dead code) both apply. Use the canonical constant names throughout.


5. _attr() helper belongs in shared utils

opencontractserver/pipeline/parsers/liteparse_parser.py, lines 95–105

def _attr(obj: Any, name: str, default: Any = None) -> Any:
    """Read ``name`` from a LiteParse dataclass *or* a plain dict."""
    if isinstance(obj, dict):
        return obj.get(name, default)
    return getattr(obj, name, default)

This is a useful cross-parser utility for handling SDK types that may return either dataclasses or dicts. Per CLAUDE.md rule 6, utility functions belong in opencontractserver/utils/. Placing it there also lets llamaparse_parser.py and future parsers reuse it rather than duplicating it.


Non-issues confirmed during review:

  • Heading stack append ordering (str(annotation_id_counter) before the increment on line 647) is correct — annotation ids match what's pushed onto the stack.
  • redact_sensitive_kwargs exists at opencontractserver/utils/logging.py and covers the password key.
  • extract_pawls_tokens_from_pdf's discarded 6-tuple elements (page dims and full content) are intentionally dropped.
  • _bounds_from_item clamping is sound for all valid page dimensions (normalization guards against <= 0 before this point is reached).
  • DEFAULT_PDF_PAGE_WIDTH/HEIGHT float vs. prior int literals causes no downstream type mismatch — PawlsPageBoundaryPythonType declares these as float and every call site casts via float().

JSv4 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

No code change for this pass — evaluated all six, none are actionable:

Thanks for the confirmations on the non-issues (heading-stack ordering, redaction coverage, bounds clamping, float-vs-int). The PR is functionally complete; remaining suggestions are all cross-parser refactors better done in their own change.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Liteparse Ingestion Engine

2 participants