Skip to content

feat: Add PaddleOCR-VL document converter#2567

Merged
anakin87 merged 9 commits into
deepset-ai:mainfrom
Bobholamovic:feat/paddleocr
Dec 10, 2025
Merged

feat: Add PaddleOCR-VL document converter#2567
anakin87 merged 9 commits into
deepset-ai:mainfrom
Bobholamovic:feat/paddleocr

Conversation

@Bobholamovic
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.

How did you test it?

This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.

Notes for the reviewer

Checklist

@Bobholamovic Bobholamovic requested a review from a team as a code owner November 26, 2025 12:33
@Bobholamovic Bobholamovic requested review from mpangrazzi and removed request for a team November 26, 2025 12:33
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 26, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the type:documentation Improvements or additions to documentation label Nov 26, 2025
@anakin87 anakin87 self-requested a review November 26, 2025 13:43
@anakin87 anakin87 mentioned this pull request Nov 26, 2025
8 tasks
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

@Bobholamovic
Copy link
Copy Markdown
Contributor Author

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Thanks. The CI workflow has been added.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation.

I did a first pass and found some opportunities for improvement...

Comment thread integrations/paddleocr/pydoc/config.yml Outdated
Comment thread .github/workflows/paddleocr.yml
Comment thread integrations/paddleocr/pyproject.toml Outdated
Comment thread integrations/paddleocr/pyproject.toml Outdated
Comment thread integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py Outdated
Comment thread integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py Outdated
@sjrl sjrl removed the request for review from mpangrazzi December 10, 2025 07:52
@sjrl sjrl added the new integration Discuss the creation of a new integration in Core label Dec 10, 2025
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments.

Please also update the labeler configuration file, adding an entry for Paddle OCR

Comment thread integrations/paddleocr/README.md Outdated
Comment thread integrations/paddleocr/pyproject.toml Outdated
@Bobholamovic
Copy link
Copy Markdown
Contributor Author

All comments have been resolved. Please take a look.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'll merge this PR and release the first version of this package (0.1.0).

Let's move to #2569 to discuss remaining tasks.

@anakin87 anakin87 merged commit 1f816ac into deepset-ai:main Dec 10, 2025
11 checks passed
@Bobholamovic Bobholamovic deleted the feat/paddleocr branch December 10, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new integration Discuss the creation of a new integration in Core topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants