Skip to content

feat: Add docling-serve integration#3139

Draft
cbrumm wants to merge 7 commits intodeepset-ai:mainfrom
cbrumm:feature/docling-serve-integration
Draft

feat: Add docling-serve integration#3139
cbrumm wants to merge 7 commits intodeepset-ai:mainfrom
cbrumm:feature/docling-serve-integration

Conversation

@cbrumm
Copy link
Copy Markdown

@cbrumm cbrumm commented Apr 13, 2026

Summary

Draft created at PyCon DE Sprints — feedback and review welcome!

Relates to #2960

  • Adds a new DoclingServeConverter component that converts documents via a running docling-serve REST API instance
  • Separate from the existing docling-haystack integration to avoid heavy local dependencies (PyTorch, etc.)
  • Supports file paths, ByteStream objects, and URLs as sources
  • Implements both run() (sync via httpx.Client) and run_async() (via httpx.AsyncClient)
  • Integration tests run against a real docling-serve-cpu Docker container on Linux CI runners

Note that docling-server docker images are quite large, I think around 4 GB! See pypi docling-serve.

Design decisions

  • Keyword-only init with base_url, api_key (Secret), timeout, convert_options (single dict for all docling-serve parameters)
  • Dual endpoint support: URL strings route to /v1/convert/source (v1 sources format with kind discriminator), local files and ByteStreams to /v1/convert/file
  • httpx for both sync and async HTTP in a single dependency
  • Graceful error handling: HTTP errors are logged as warnings; processing continues for remaining sources
  • Integration tests: CI starts ghcr.io/docling-project/docling-serve-cpu via docker run on Linux runners, with health-check polling. macOS/Windows skip integration tests.

Test plan

  • hatch run fmt — all checks passed
  • hatch run test:types — no issues found
  • hatch run test:unit — 30/30 passed
  • Integration test: file conversion via /v1/convert/file
  • Integration test: URL conversion via /v1/convert/source (validates v1 sources format)
  • Review metadata attached to returned Documents
  • Review convert_options passthrough behavior

🤖 Generated with Claude Code

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
Add a new DoclingServeConverter component that converts documents via a
running docling-serve REST API instance, avoiding docling's heavy local
dependencies (PyTorch, etc.).

Relates to deepset-ai#2960

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cbrumm cbrumm force-pushed the feature/docling-serve-integration branch from ccb9443 to ec97ce5 Compare April 13, 2026 09:43
cbrumm and others added 6 commits April 13, 2026 11:48
The API reference build expects docling_serve.md (matching the
integration folder name), not docling-serve.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pytest exits with code 5 when no tests are collected. The integration
test step selects only @pytest.mark.integration tests, which were
missing. Add a skipped integration test that runs when DOCLING_SERVE_URL
is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… conversion

The /v1/convert/source endpoint requires the new discriminated union format
with "sources": [{"kind": "http", "url": ...}] instead of the deprecated
"http_sources" field. Updated both sync and async code paths and tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Start docling-serve-cpu container on Linux runners and run integration
tests against it. Adds a URL conversion test to verify the v1 sources
format fix end-to-end. macOS/Windows runners skip integration tests
since Docker services are not available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ptions

Create a single httpx.Client/AsyncClient per run() call instead of one
per source, avoiding unnecessary connection overhead when converting
multiple documents. Also copy the convert_options dict on init to prevent
external mutation from affecting component behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants