Skip to content

Commit 9e73670

Browse files
feat: add Docling document converter (#3066)
* feat: add Docling document converter Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fixing formatting issue Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * pin lxml dependency for python 3.14 compatibility * adjust workflow incl code coverage tests * fix code coverage tracking * adjust docstring style * remove docling.md file * change private to public attributes for default serialization * test backward compatibility of deserialization * backward compatibility for old import paths * lint * dont treat 0 integration tests as error --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>
1 parent 58b00df commit 9e73670

14 files changed

Lines changed: 932 additions & 0 deletions

File tree

.github/labeler.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,11 @@ integration:deepeval:
5959
- any-glob-to-any-file: "integrations/deepeval/**/*"
6060
- any-glob-to-any-file: ".github/workflows/deepeval.yml"
6161

62+
integration:docling:
63+
- changed-files:
64+
- any-glob-to-any-file: "integrations/docling/**/*"
65+
- any-glob-to-any-file: ".github/workflows/docling.yml"
66+
6267
integration:elasticsearch:
6368
- changed-files:
6469
- any-glob-to-any-file: "integrations/elasticsearch/**/*"

.github/workflows/docling.yml

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# This workflow comes from https://github.com/ofek/hatch-mypyc
2+
# https://github.com/ofek/hatch-mypyc/blob/5a198c0ba8660494d02716cfc9d79ce4adfb1442/.github/workflows/test.yml
3+
name: Test / docling
4+
5+
on:
6+
schedule:
7+
- cron: "0 0 * * *"
8+
pull_request:
9+
paths:
10+
- "integrations/docling/**"
11+
- "!integrations/docling/*.md"
12+
- ".github/workflows/docling.yml"
13+
push:
14+
branches:
15+
- main
16+
paths:
17+
- "integrations/docling/**"
18+
- "!integrations/docling/*.md"
19+
- ".github/workflows/docling.yml"
20+
21+
defaults:
22+
run:
23+
working-directory: integrations/docling
24+
25+
concurrency:
26+
group: docling-${{ github.head_ref || github.sha }}
27+
cancel-in-progress: true
28+
29+
env:
30+
PYTHONUNBUFFERED: "1"
31+
FORCE_COLOR: "1"
32+
TEST_MATRIX_OS: '["ubuntu-latest", "windows-latest", "macos-latest"]'
33+
TEST_MATRIX_PYTHON: '["3.10", "3.14"]'
34+
35+
jobs:
36+
compute-test-matrix:
37+
runs-on: ubuntu-slim
38+
defaults:
39+
run:
40+
working-directory: .
41+
outputs:
42+
os: ${{ steps.set.outputs.os }}
43+
python-version: ${{ steps.set.outputs.python-version }}
44+
steps:
45+
- id: set
46+
run: |
47+
echo 'os=${{ github.event_name == 'push' && '["ubuntu-latest"]' || env.TEST_MATRIX_OS }}' >> "$GITHUB_OUTPUT"
48+
echo 'python-version=${{ github.event_name == 'push' && '["3.10"]' || env.TEST_MATRIX_PYTHON }}' >> "$GITHUB_OUTPUT"
49+
50+
run:
51+
name: Python ${{ matrix.python-version }} on ${{ startsWith(matrix.os, 'macos-') && 'macOS' || startsWith(matrix.os, 'windows-') && 'Windows' || 'Linux' }}
52+
needs: compute-test-matrix
53+
runs-on: ${{ matrix.os }}
54+
permissions:
55+
contents: write
56+
pull-requests: write
57+
strategy:
58+
fail-fast: false
59+
matrix:
60+
os: ${{ fromJSON(needs.compute-test-matrix.outputs.os) }}
61+
python-version: ${{ fromJSON(needs.compute-test-matrix.outputs.python-version) }}
62+
63+
steps:
64+
- name: Support longpaths
65+
if: matrix.os == 'windows-latest'
66+
working-directory: .
67+
run: git config --system core.longpaths true
68+
69+
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
70+
71+
- name: Set up Python ${{ matrix.python-version }}
72+
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
73+
with:
74+
python-version: ${{ matrix.python-version }}
75+
76+
- name: Install Hatch
77+
run: pip install --upgrade hatch
78+
79+
- name: Lint
80+
if: matrix.python-version == '3.10' && runner.os == 'Linux'
81+
run: hatch run fmt-check && hatch run test:types
82+
83+
- name: Run unit tests
84+
run: hatch run test:unit-cov-retry
85+
86+
# On PR: generates coverage comment artifact. On push to main: stores coverage baseline on data branch.
87+
- name: Store unit tests coverage
88+
if: matrix.python-version == '3.10' && runner.os == 'Linux' && github.event_name != 'schedule'
89+
uses: py-cov-action/python-coverage-comment-action@7188638f871f721a365d644f505d1ff3df20d683 # v3.40
90+
with:
91+
GITHUB_TOKEN: ${{ github.token }}
92+
COVERAGE_PATH: integrations/docling
93+
SUBPROJECT_ID: docling
94+
COMMENT_ARTIFACT_NAME: coverage-comment-docling
95+
MINIMUM_GREEN: 90
96+
MINIMUM_ORANGE: 60
97+
98+
- name: Run integration tests
99+
run: hatch run test:integration-cov-append-retry
100+
101+
- name: Store combined coverage
102+
if: github.event_name == 'push'
103+
uses: py-cov-action/python-coverage-comment-action@7188638f871f721a365d644f505d1ff3df20d683 # v3.40
104+
with:
105+
GITHUB_TOKEN: ${{ github.token }}
106+
COVERAGE_PATH: integrations/docling
107+
SUBPROJECT_ID: docling-combined
108+
COMMENT_ARTIFACT_NAME: coverage-comment-docling-combined
109+
MINIMUM_GREEN: 90
110+
MINIMUM_ORANGE: 60
111+
112+
- name: Run unit tests with lowest direct dependencies
113+
if: github.event_name != 'push'
114+
run: |
115+
hatch run uv pip compile pyproject.toml --resolution lowest-direct --output-file requirements_lowest_direct.txt
116+
hatch -e test env run -- uv pip install -r requirements_lowest_direct.txt
117+
hatch run test:unit
118+
119+
- name: Nightly - run tests with Haystack main branch
120+
if: github.event_name == 'schedule'
121+
run: |
122+
hatch env prune
123+
hatch -e test env run -- uv pip install git+https://github.com/deepset-ai/haystack.git@main
124+
hatch run test:unit-cov-retry
125+
126+
127+
notify-slack-on-failure:
128+
needs: run
129+
if: failure() && github.event_name == 'schedule'
130+
runs-on: ubuntu-slim
131+
steps:
132+
- uses: deepset-ai/notify-slack-action@3cda73b77a148f16f703274198e7771340cf862b # v1
133+
with:
134+
slack-webhook-url: ${{ secrets.SLACK_WEBHOOK_URL_NOTIFICATIONS }}

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Please check out our [Contribution Guidelines](CONTRIBUTING.md) for all the deta
3737
| [cohere-haystack](integrations/cohere/) | Embedder, Generator, Ranker | [![PyPI - Version](https://img.shields.io/pypi/v/cohere-haystack.svg)](https://pypi.org/project/cohere-haystack) | [![Test / cohere](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/cohere.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/cohere.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-cohere/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-cohere/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-cohere-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-cohere-combined/htmlcov/index.html) |
3838
| [cometapi-haystack](integrations/cometapi/) | Generator | [![PyPI - Version](https://img.shields.io/pypi/v/cometapi-haystack.svg)](https://pypi.org/project/cometapi-haystack) | [![Test / cometapi](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/cometapi.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/cometapi.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-cometapi/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-cometapi/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-cometapi-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-cometapi-combined/htmlcov/index.html) |
3939
| [deepeval-haystack](integrations/deepeval/) | Evaluator | [![PyPI - Version](https://img.shields.io/pypi/v/deepeval-haystack.svg)](https://pypi.org/project/deepeval-haystack) | [![Test / deepeval](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/deepeval.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/deepeval.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-deepeval/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-deepeval/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-deepeval-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-deepeval-combined/htmlcov/index.html) |
40+
| [docling-haystack](integrations/docling/) | Converter | [![PyPI - Version](https://img.shields.io/pypi/v/docling-haystack.svg)](https://pypi.org/project/docling-haystack) | [![Test / docling](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/docling.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/docling.yml) |
4041
| [dspy-haystack](integrations/dspy/) | Generator | [![PyPI - Version](https://img.shields.io/pypi/v/dspy-haystack.svg)](https://pypi.org/project/dspy-haystack) | [![Test / dspy](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/dspy.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/dspy.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-dspy/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-dspy/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-dspy-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-dspy-combined/htmlcov/index.html) |
4142
| [elasticsearch-haystack](integrations/elasticsearch/) | Document Store | [![PyPI - Version](https://img.shields.io/pypi/v/elasticsearch-haystack.svg)](https://pypi.org/project/elasticsearch-haystack) | [![Test / elasticsearch](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/elasticsearch.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/elasticsearch.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-elasticsearch/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-elasticsearch/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-elasticsearch-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-elasticsearch-combined/htmlcov/index.html) |
4243
| [faiss-haystack](integrations/faiss/) | Document Store | [![PyPI - Version](https://img.shields.io/pypi/v/faiss-haystack.svg)](https://pypi.org/project/faiss-haystack) | [![Test / faiss](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/faiss.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/faiss.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-faiss/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-faiss/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-faiss-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-faiss-combined/htmlcov/index.html) |
@@ -80,6 +81,7 @@ Please check out our [Contribution Guidelines](CONTRIBUTING.md) for all the deta
8081
| [weave-haystack](integrations/weave/) | Tracer | [![PyPI - Version](https://img.shields.io/pypi/v/weave-haystack.svg)](https://pypi.org/project/weave-haystack) | [![Test / weave](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/weave.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/weave.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-weave/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-weave/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-weave-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-weave-combined/htmlcov/index.html) |
8182
| [weaviate-haystack](integrations/weaviate/) | Document Store | [![PyPI - Version](https://img.shields.io/pypi/v/weaviate-haystack.svg)](https://pypi.org/project/weaviate-haystack) | [![Test / weaviate](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/weaviate.yml/badge.svg)](https://github.com/deepset-ai/haystack-core-integrations/actions/workflows/weaviate.yml) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-weaviate/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-weaviate/htmlcov/index.html) | [![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/deepset-ai/haystack-core-integrations/python-coverage-comment-action-data-weaviate-combined/endpoint.json&label=)](https://htmlpreview.github.io/?https://github.com/deepset-ai/haystack-core-integrations/blob/python-coverage-comment-action-data-weaviate-combined/htmlcov/index.html) |
8283
| [instructor-embedders-haystack](integrations/instructor_embedders/) | Embedder | [![PyPI - Version](https://img.shields.io/pypi/v/instructor-embedders-haystack.svg)](https://pypi.org/project/instructor-embedders-haystack) | [Staged](https://docs.haystack.deepset.ai/docs/breaking-change-policy#discontinuing-an-integration) | | |
84+
8385
## Releasing
8486

8587
> [!NOTE]

0 commit comments

Comments
 (0)