feat: add threaded docling-parse (v6) PDF backend by cau-git · Pull Request #3377 · docling-project/docling

cau-git · 2026-04-28T19:56:23Z

Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread.

Backend contract additions:

PdfPageBackend.page_no abstract property (all existing backends updated: pypdfium2, image, mets-gbs)
PdfDocumentBackend.iter_pages() default via load_page(); threaded backend overrides to yield in completion order

ThreadedDoclingParseDocumentBackend specifics:

Constructs one DoclingThreadedPdfParser per instance with fixed decode and render config; no pypdfium2 dependency
Passes page_numbers=None for the default (all-pages) case; explicit range list only when the caller specifies a finite page_range
page_count() delegates to parser.page_count(doc_key)
Coordinate conversion (bottom-left âop-left) applied once in get_segmented_page() and cached; all downstream consumers (layout postprocessor, OCR merge, assembly) receive top-left cells

Pipeline (StandardPdfPipeline / PreprocessThreadedStage):

Producer thread attaches page backends from iter_pages() to ordered page stubs by page_no, then enqueues ThreadedItems
Invalid page backends are separated before model calls so they cannot be double-emitted if the preprocessing model raises
Timeout and early-termination accounting tracks by page-number sets

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

** Benchmarks **

uv run python ./perfs/iterate_pdf_pages.py -r docling-project/performance-dataset-cornercases --parser-threads 8 --release-native-memory-every-n-pages 0

uv run python ./perfs/iterate_pdf_pages.py -r docling-project/performance-dataset-bo767 --parser-threads 8 --release-native-memory-every-n-pages 0

Introduces ThreadedDoclingParseDocumentBackend and ThreadedDoclingParsePageBackend as a new PDF backend that drives docling-parse's threaded API directly. The matching StandardPdfPipeline runs page parsing, OCR, layout, table, and assembly in concurrent pipeline stages with a dedicated producer thread. Backend contract additions: - PdfPageBackend.page_no abstract property (all existing backends updated: pypdfium2, image, mets-gbs) - PdfDocumentBackend.iter_pages() default via load_page(); threaded backend overrides to yield in completion order ThreadedDoclingParseDocumentBackend specifics: - Constructs one DoclingThreadedPdfParser per instance with fixed decode and render config; no pypdfium2 dependency - Passes page_numbers=None for the default (all-pages) case; explicit range list only when the caller specifies a finite page_range - page_count() delegates to parser.page_count(doc_key) - Coordinate conversion (bottom-left âop-left) applied once in get_segmented_page() and cached; all downstream consumers (layout postprocessor, OCR merge, assembly) receive top-left cells Pipeline (StandardPdfPipeline / PreprocessThreadedStage): - Producer thread attaches page backends from iter_pages() to ordered page stubs by page_no, then enqueues ThreadedItems - Invalid page backends are separated before model calls so they cannot be double-emitted if the preprocessing model raises - Timeout and early-termination accounting tracks by page-number sets Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

mergify · 2026-04-28T19:56:59Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

github-actions · 2026-04-28T20:03:28Z

✅ DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

…e-threaded

- add dedicated threaded docling-parse backend options and wire CLI num_threads into parser_threads - make the threaded backend honor parser_threads, falling back to AcceleratorOptions only when unset - resolve threaded page ranges explicitly and clip open-ended requests against the actual document length - cache page sizes in StandardPdfPipeline so failed-page recovery does not call load_page() on iterator-only threaded backends - reject threaded docling-parse in VLM pipelines that still require ordered/random load_page() access - extend backend, CLI, and compatibility tests for the new threaded backend behavior - update the editable docling-parse lock entry Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

PeterStaar-IBM

lgtm!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

codecov · 2026-05-22T07:58:54Z

Codecov Report

❌ Patch coverage is 77.20588% with 62 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/docling_parse_backend.py	78.40%	27 Missing ⚠️
docling/pipeline/standard_pdf_pipeline.py	79.34%	19 Missing ⚠️
...erimental/pipeline/threaded_layout_vlm_pipeline.py	0.00%	6 Missing ⚠️
docling/pipeline/extraction_vlm_pipeline.py	28.57%	5 Missing ⚠️
docling/pipeline/vlm_pipeline.py	40.00%	3 Missing ⚠️
docling/backend/mets_gbs_backend.py	83.33%	1 Missing ⚠️
docling/backend/pdf_backend.py	90.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

dosubot · 2026-05-22T08:01:16Z

Documentation Updates

1 document(s) were updated by changes in this PR:

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Changes

@@ -3,7 +3,8 @@
 - **Key Options**:
     - `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
     - `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
-    - `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+    - `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `threaded_docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+        - `threaded_docling_parse`: Threaded Docling Parse backend optimized for concurrent page parsing in the standard PDF pipeline
     - `do_ocr` (default True): Use OCR
     - `force_ocr`: Replace existing text with OCR-generated text
     - `ocr_engine`, `ocr_lang`: OCR engine and language options
@@ -22,6 +23,21 @@
     - `force_backend_text`: Force backend text extraction
     - `layout_custom_config`, `table_structure_custom_config`: Custom model configs for layout/table structure (see Table Structure Models section below)
     - Additional options for picture description and more
+
+---
+
+### PDF Backend Options
+
+#### ThreadedDoclingParseBackendOptions
+
+Options specific to the threaded docling-parse backend:
+
+- **Configuration Class**: `ThreadedDoclingParseBackendOptions`
+- **Kind**: `"threaded-docling-parse"`
+- **Key Options**:
+    - `parser_threads` (Optional[PositiveInt], default: None): Number of parser threads to use for the threaded docling-parse backend. If unset, the backend falls back to global accelerator thread settings.
+    - `release_native_memory_every_n_pages` (integer >= 0, default: 128): Release native parser memory after every N decoded pages in the threaded docling-parse backend. Set to 0 to disable native-memory release.
+    - `password` (Optional[SecretStr]): Password for encrypted PDFs (inherited from `PdfBackendOptions`)
 
 ---
 
@@ -418,6 +434,10 @@
 docling --progress FILE
 ```
 
+**CLI Flags**: The CLI also supports the following flags for the threaded docling-parse backend:
+
+- `--release-native-memory-every-n-pages` (default: 128): Release native parser memory after every N decoded pages when using the threaded docling-parse backend.
+
 ---
 
 #### Additional Notes

^{How did I do? Any feedback?}

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

…ct/docling into cau/docling-parse-threaded

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

PeterStaar-IBM

lgtm!

PeterStaar-IBM requested review from PeterStaar-IBM and dolfim-ibm April 28, 2026 20:05

cau-git added 3 commits April 28, 2026 22:11

Adjust tests and thread count source

cb2fe9d

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Add threaded parse to CLI, make RGB images

e51fc23

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Merge branch 'main' of github.com:DS4SD/docling into cau/docling-pars…

847c725

…e-threaded

geoHeil mentioned this pull request Apr 29, 2026

ci: prepare editable docling-parse for docs #3385

Closed

cau-git added 2 commits May 8, 2026 16:14

Merge branch 'main' of github.com:DS4SD/docling into cau/docling-pars…

d3ac439

…e-threaded

PeterStaar-IBM previously approved these changes May 19, 2026

View reviewed changes

cau-git changed the title ~~feat: add threaded PDF backend consuming DoclingThreadedPdfParser~~ feat: add threaded docling-parse (v6) PDF backend May 22, 2026

Update to docling-parse>=6

c7f0e49

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git dismissed PeterStaar-IBM’s stale review via c7f0e49 May 22, 2026 07:55

cau-git marked this pull request as ready for review May 22, 2026 07:55

cau-git and others added 7 commits May 22, 2026 11:12

allow more lines

42ea365

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Address concerns with non-threaded PDF backend behaviour changes

1a65caf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

fix nonsense test

cc2a78c

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

adding the feature

4af19d9

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

Merge branch 'cau/docling-parse-threaded' of github.com:docling-proje…

e785c7f

…ct/docling into cau/docling-parse-threaded

adding perfomance measuring scripts

988f37d

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

adding evaluation performance scripts

7bfd2ad

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

PeterStaar-IBM mentioned this pull request May 26, 2026

memory problem in v 5.3.3 docling-project/docling-parse#227

Closed

PeterStaar-IBM and others added 3 commits May 26, 2026 19:54

upgrading to docling-parse of 6.1.0

7f22f46

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

upgraded to docling-parse >=6.1

487d541

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

backend: disable bitmap byte materialization in docling-parse backends

8a76c44

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git force-pushed the cau/docling-parse-threaded branch from 9c8dc5c to 8a76c44 Compare May 27, 2026 10:47

cau-git and others added 5 commits May 27, 2026 15:19

Updates to iterate_pdf_pages script, lock update

de70143

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Correct decode config

8b7218b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

pinned to docling-parse v6.2.0

ed23d89

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

merged with main

139906a

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

clean up pyproject.toml

77ee26a

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

PeterStaar-IBM approved these changes May 28, 2026

View reviewed changes

dolfim-ibm approved these changes May 28, 2026

View reviewed changes

PeterStaar-IBM merged commit 3c26f5a into main May 28, 2026
46 checks passed

PeterStaar-IBM deleted the cau/docling-parse-threaded branch May 28, 2026 10:38

selloriwoo mentioned this pull request Jun 9, 2026

feat: switch to threaded PDF pipeline (docling 2.99) brekkylab/agent-k#163

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add threaded docling-parse (v6) PDF backend #3377

feat: add threaded docling-parse (v6) PDF backend #3377
PeterStaar-IBM merged 22 commits into
mainfrom
cau/docling-parse-threaded

cau-git commented Apr 28, 2026 •

edited by PeterStaar-IBM

Loading

Uh oh!

mergify Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

dosubot Bot commented May 22, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cau-git commented Apr 28, 2026 • edited by PeterStaar-IBM Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Apr 28, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dosubot Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cau-git commented Apr 28, 2026 •

edited by PeterStaar-IBM

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading

dosubot Bot commented May 22, 2026 •

edited

Loading