feat: Add PaddleOCR-VL document converter #2567

Bobholamovic · 2025-11-26T12:33:10Z

Related Issues

Proposed Changes:

This is a new feature that adds the official PaddleOCR integration for Haystack, providing a PaddleOCR-VL document converter component. The component leverages PaddleOCR's API for document parsing and supports text extraction from both PDF and image files.

How did you test it?

This PR includes a complete unit test suite including initialization tests, parameter validation, file type inference, API call tests, etc. Tests cover PDF and image file processing.

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

CLAassistant · 2025-11-26T12:33:18Z

All committers have signed the CLA.

anakin87

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Bobholamovic · 2025-11-27T03:38:14Z

Hey... thanks for the implementation!

I created #2569 to track the work to be done.

I don't think that I will be able to review this PR in detail soon, but in the meantime, please add a CI workflow similar to https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/anthropic.yml to make sure that tests run in the CI.

Thanks. The CI workflow has been added.

anakin87

Thanks for the implementation.

I did a first pass and found some opportunities for improvement...

integrations/paddleocr/pydoc/config.yml

.github/workflows/paddleocr.yml

integrations/paddleocr/pyproject.toml

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

integrations/paddleocr/tests/test_paddleocr_vl_document_converter.py

anakin87 · 2025-12-02T10:49:45Z

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

+
+
+@component
+class PaddleOCRVLDocumentConverter:


Could you explain why PaddleOCRVLDocumentConverter instead of a more generic PaddleOCRDocumentConverter? Is this component tied to a specific model? It might be used with other models?

PaddleOCR offers two document-parsing solutions: PP-StructureV3 and PaddleOCR-VL, and the interfaces are slightly different. The PaddleOCRVLDocumentConverter uses the PaddleOCR-VL algorithm internally, which is why it is named this way. We planned to add a PPStructureV3DocumentConverter in a future PR. Would you recommend using a single all-in-one class that supports the different PaddleOCR algorithms as backends, or separate classes as we originally planned?

I don't know the differences in terms of interface. I just would not like to have almost identical components but I trust you on this aspect. Please provide more information if possible.

If we define two separate converters, I expect the following differences between them:

The optional parameters accepted by their __init__ methods differ significantly. You can see the differences in the supported arguments for the two algorithms here and here.

Although the run method of each converter outputs raw_paddleocr_responses with a broadly similar structure, the exact fields are not identical, and there are notable differences.

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py

Add PaddleOCR-VL document converter

4bcc397

Bobholamovic requested a review from a team as a code owner November 26, 2025 12:33

Bobholamovic requested review from mpangrazzi and removed request for a team November 26, 2025 12:33

github-actions bot added the type:documentation Improvements or additions to documentation label Nov 26, 2025

Bobholamovic added 2 commits November 26, 2025 20:52

Fix image extensions

95698ee

Add type ignore comment

d3e6020

anakin87 self-requested a review November 26, 2025 13:43

anakin87 mentioned this pull request Nov 26, 2025

Paddle OCR Integration #2569

Open

9 tasks

anakin87 reviewed Nov 26, 2025

View reviewed changes

Add CI workflow

e8da287

github-actions bot added the topic:CI label Nov 27, 2025

Update paddleocr and paddlex version

c9bbd7b

anakin87 requested changes Dec 2, 2025

View reviewed changes

Revise according to review comments

356bf4a

anakin87 reviewed Dec 3, 2025

View reviewed changes

...src/haystack_integrations/components/converters/paddleocr/paddleocr_vl_document_converter.py Outdated Show resolved Hide resolved

Bobholamovic added 2 commits December 3, 2025 18:14

Remove unused code

85e66bb

Fix code style

5bbbb19

feat: Add PaddleOCR-VL document converter #2567

Are you sure you want to change the base?

feat: Add PaddleOCR-VL document converter #2567

Uh oh!

Conversation

Bobholamovic commented Nov 26, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

CLAassistant commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Bobholamovic commented Nov 27, 2025

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Bobholamovic Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anakin87 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Bobholamovic Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Nov 26, 2025 •

edited

Loading

Bobholamovic Dec 3, 2025 •

edited

Loading