Skip to content

feat: add threaded docling-parse (v6) PDF backend #3377

Open
cau-git wants to merge 13 commits into
mainfrom
cau/docling-parse-threaded
Open

feat: add threaded docling-parse (v6) PDF backend #3377
cau-git wants to merge 13 commits into
mainfrom
cau/docling-parse-threaded

Conversation

@cau-git
Copy link
Copy Markdown
Member

@cau-git cau-git commented Apr 28, 2026

Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread.

Backend contract additions:

  • PdfPageBackend.page_no abstract property (all existing backends updated: pypdfium2, image, mets-gbs)
  • PdfDocumentBackend.iter_pages() default via load_page(); threaded backend overrides to yield in completion order

ThreadedDoclingParseDocumentBackend specifics:

  • Constructs one DoclingThreadedPdfParser per instance with fixed decode and render config; no pypdfium2 dependency
  • Passes page_numbers=None for the default (all-pages) case; explicit range list only when the caller specifies a finite page_range
  • page_count() delegates to parser.page_count(doc_key)
  • Coordinate conversion (bottom-left âop-left) applied once in get_segmented_page() and cached; all downstream consumers (layout postprocessor, OCR merge, assembly) receive top-left cells

Pipeline (StandardPdfPipeline / PreprocessThreadedStage):

  • Producer thread attaches page backends from iter_pages() to ordered page stubs by page_no, then enqueues ThreadedItems
  • Invalid page backends are separated before model calls so they cannot be double-emitted if the preprocessing model raises
  • Timeout and early-termination accounting tracks by page-number sets

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Introduces ThreadedDoclingParseDocumentBackend and
ThreadedDoclingParsePageBackend as a new PDF backend that drives
docling-parse's threaded API directly.  The matching StandardPdfPipeline
runs page parsing, OCR, layout, table, and assembly in concurrent
pipeline stages with a dedicated producer thread.

Backend contract additions:
- PdfPageBackend.page_no abstract property (all existing backends
  updated: pypdfium2, image, mets-gbs)
- PdfDocumentBackend.iter_pages() default via load_page(); threaded
  backend overrides to yield in completion order

ThreadedDoclingParseDocumentBackend specifics:
- Constructs one DoclingThreadedPdfParser per instance with fixed
  decode and render config; no pypdfium2 dependency
- Passes page_numbers=None for the default (all-pages) case; explicit
  range list only when the caller specifies a finite page_range
- page_count() delegates to parser.page_count(doc_key)
- Coordinate conversion (bottom-left âop-left) applied once in
  get_segmented_page() and cached; all downstream consumers (layout
  postprocessor, OCR merge, assembly) receive top-left cells

Pipeline (StandardPdfPipeline / PreprocessThreadedStage):
- Producer thread attaches page backends from iter_pages() to ordered
  page stubs by page_no, then enqueues ThreadedItems
- Invalid page backends are separated before model calls so they cannot
  be double-emitted if the preprocessing model raises
- Timeout and early-termination accounting tracks by page-number sets

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 28, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

cau-git added 3 commits April 28, 2026 22:11
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
cau-git added 2 commits May 8, 2026 16:14
- add dedicated threaded docling-parse backend options and wire CLI
  num_threads into parser_threads
- make the threaded backend honor parser_threads, falling back to
  AcceleratorOptions only when unset
- resolve threaded page ranges explicitly and clip open-ended requests
  against the actual document length
- cache page sizes in StandardPdfPipeline so failed-page recovery does
  not call load_page() on iterator-only threaded backends
- reject threaded docling-parse in VLM pipelines that still require
  ordered/random load_page() access
- extend backend, CLI, and compatibility tests for the new threaded
  backend behavior
- update the editable docling-parse lock entry

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
PeterStaar-IBM
PeterStaar-IBM previously approved these changes May 19, 2026
Copy link
Copy Markdown
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@cau-git cau-git changed the title feat: add threaded PDF backend consuming DoclingThreadedPdfParser feat: add threaded docling-parse (v6) PDF backend May 22, 2026
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git marked this pull request as ready for review May 22, 2026 07:55
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

@dosubot
Copy link
Copy Markdown

dosubot Bot commented May 22, 2026

Related Knowledge

1 document with suggested updates is ready for review.

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -4,6 +4,9 @@
     - `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
     - `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
     - `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+          - `threaded_docling_parse`: Threaded Docling Parse backend optimized for concurrent page parsing in the standard PDF pipeline
+      - Backend-specific options:
+          - For `threaded_docling_parse`: `parser_threads` (Optional[PositiveInt]): Number of parser threads to use for the threaded docling-parse backend. If unset, the backend falls back to global accelerator thread settings
     - `do_ocr` (default True): Use OCR
     - `force_ocr`: Replace existing text with OCR-generated text
     - `ocr_engine`, `ocr_lang`: OCR engine and language options

[Accept] [Edit] [Decline]

How did I do? Any feedback?  Join Discord

cau-git and others added 6 commits May 22, 2026 11:12
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants