-
Notifications
You must be signed in to change notification settings - Fork 54
Update Docling version and improve OCR options handling with new docling ver. #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Docling version and improve OCR options handling with new docling ver. #574
Conversation
…th bumped ver. Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
Signed-off-by: eshwarprasadS <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eshwarprasadS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cloned this and ran it locally with some markdowns that were erroring out previously due to a Docling bug around unescaped headings. With this updated Docling version, those markdowns are now chunking properly.
Also, I looked at the change to the chunking test and it looks reasonable.
The only thing I'd ask, which could be done as a follow-up PR, is that we add a simplified example of markdown that failed with our previous docling version and that will pass with this new docling version. That's just to prevent regression here, but we don't have to hold up merging this PR itself for that unless that's quick and easy.
Signed-off-by: eshwarprasadS <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great just a few nits but won't block on these. Thanks!
Signed-off-by: eshwarprasadS <[email protected]>
20eb2c1
to
0ba442a
Compare
Signed-off-by: Khaled Sulayman <[email protected]>
0ba442a
to
0cafab8
Compare
@khaledsulayman noticed e2e fails here. Thanks @courtneypacheco for looking into this. Since @eshwarprasadS created this PR from his fork, GitHub will not take the changes to .github/workflows/e2e-nvidia-t4-x1.yml into account. To get the e2e tests to run with proper credentials on this PR's changes:
Then you should be able to run the CI changes in this PR before merging to |
We should be able to remove the constraints.txt and workflow changes here, as there was a bug in Python SetupTools 77.0.3 with DeepSpeed that is now resolved with a newer Python SetupTools that our most recent CI builds are picking up. See deepspeedai/DeepSpeed#7165 for other reports of this, but I've since seen our CI pass as the jobs are now picking up SetupTools 78.x. |
Signed-off-by: eshwarprasadS <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>
@Mergifyio backport release-v0.7 |
✅ Backports have been created
|
Update Docling version and improve OCR options handling with new docling ver. (backport #574)
This PR intends to bump the version of
docling
fromdocling>=2.4.2,<=2.8.3
todocling>=2.18.0
. This is to bring in the fix for the particular docling chunking failure issue on markdowns with unescaped special characters (docling-project/docling#823)The primary changes are:
requirements.txt
tox.ini
andchunkers.py
docling.parse
intaxonomy.py
, since the pdf parsed doc content is not necessary to be passed toDocumentChunker