feat: add threaded docling-parse (v6) PDF backend #3377
Open
cau-git wants to merge 13 commits into
Open
Conversation
Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread. Backend contract additions: - PdfPageBackend.page_no abstract property (all existing backends updated: pypdfium2, image, mets-gbs) - PdfDocumentBackend.iter_pages() default via load_page(); threaded backend overrides to yield in completion order ThreadedDoclingParseDocumentBackend specifics: - Constructs one DoclingThreadedPdfParser per instance with fixed decode and render config; no pypdfium2 dependency - Passes page_numbers=None for the default (all-pages) case; explicit range list only when the caller specifies a finite page_range - page_count() delegates to parser.page_count(doc_key) - Coordinate conversion (bottom-left âop-left) applied once in get_segmented_page() and cached; all downstream consumers (layout postprocessor, OCR merge, assembly) receive top-left cells Pipeline (StandardPdfPipeline / PreprocessThreadedStage): - Producer thread attaches page backends from iter_pages() to ordered page stubs by page_no, then enqueues ThreadedItems - Invalid page backends are separated before model calls so they cannot be double-emitted if the preprocessing model raises - Timeout and early-termination accounting tracks by page-number sets Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Contributor
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
- add dedicated threaded docling-parse backend options and wire CLI num_threads into parser_threads - make the threaded backend honor parser_threads, falling back to AcceleratorOptions only when unset - resolve threaded page ranges explicitly and clip open-ended requests against the actual document length - cache page sizes in StandardPdfPipeline so failed-page recovery does not call load_page() on iterator-only threaded backends - reject threaded docling-parse in VLM pipelines that still require ordered/random load_page() access - extend backend, CLI, and compatibility tests for the new threaded backend behavior - update the editable docling-parse lock entry Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Related Knowledge 1 document with suggested updates is ready for review. Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -4,6 +4,9 @@
- `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
- `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
- `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+ - `threaded_docling_parse`: Threaded Docling Parse backend optimized for concurrent page parsing in the standard PDF pipeline
+ - Backend-specific options:
+ - For `threaded_docling_parse`: `parser_threads` (Optional[PositiveInt]): Number of parser threads to use for the threaded docling-parse backend. If unset, the backend falls back to global accelerator thread settings
- `do_ocr` (default True): Use OCR
- `force_ocr`: Replace existing text with OCR-generated text
- `ocr_engine`, `ocr_lang`: OCR engine and language options |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
…ct/docling into cau/docling-parse-threaded
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread.
Backend contract additions:
ThreadedDoclingParseDocumentBackend specifics:
Pipeline (StandardPdfPipeline / PreprocessThreadedStage):
Checklist: