-
Notifications
You must be signed in to change notification settings - Fork 201
feat: Add PaddleOCR-VL document converter #2567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
anakin87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey... thanks for the implementation!
I created #2569 to track the work to be done.
I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.
Thanks. The CI workflow has been added. |
anakin87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation.
I did a first pass and found some opportunities for improvement...
...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py
Show resolved
Hide resolved
...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py
Outdated
Show resolved
Hide resolved
...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py
Outdated
Show resolved
Hide resolved
integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py
Outdated
Show resolved
Hide resolved
integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py
Outdated
Show resolved
Hide resolved
|
|
||
|
|
||
| @component | ||
| class PaddleOCRVLDocumentConverter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why PaddleOCRVLDocumentConverter instead of a more generic PaddleOCRDocumentConverter? Is this component tied to a specific model? It might be used with other models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleOCR offers two document-parsing solutions: PP-StructureV3 and PaddleOCR-VL, and the interfaces are slightly different. The PaddleOCRVLDocumentConverter uses the PaddleOCR-VL algorithm internally, which is why it is named this way. We planned to add a PPStructureV3DocumentConverter in a future PR. Would you recommend using a single all-in-one class that supports the different PaddleOCR algorithms as backends, or separate classes as we originally planned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the differences in terms of interface. I just would not like to have almost identical components but I trust you on this aspect. Please provide more information if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define two separate converters, I expect the following differences between them:
- The optional parameters accepted by their
__init__methods differ significantly. You can see the differences in the supported arguments for the two algorithms here and here. - Although the
runmethod of each converter outputsraw_paddleocr_responseswith a broadly similar structure, the exact fields are not identical, and there are notable differences.
Related Issues
Proposed Changes:
This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.
How did you test it?
This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.