feat: Add PaddleOCR-VL document converter#2567
Conversation
anakin87
left a comment
There was a problem hiding this comment.
Hey... thanks for the implementation!
I created #2569 to track the work to be done.
I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.
Thanks. The CI workflow has been added. |
anakin87
left a comment
There was a problem hiding this comment.
Thanks for the implementation.
I did a first pass and found some opportunities for improvement...
anakin87
left a comment
There was a problem hiding this comment.
I left a few minor comments.
Please also update the labeler configuration file, adding an entry for Paddle OCR
|
All comments have been resolved. Please take a look. |
Related Issues
Proposed Changes:
This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.
How did you test it?
This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.