Skip to content

Update Docling version and improve OCR options handling with new docling ver. #574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 26, 2025

Conversation

eshwarprasadS
Copy link
Contributor

This PR intends to bump the version of docling from docling>=2.4.2,<=2.8.3 to docling>=2.18.0. This is to bring in the fix for the particular docling chunking failure issue on markdowns with unescaped special characters (docling-project/docling#823)

The primary changes are:

  • updates to requirements.txt
  • updates to CI environment handling in tox.ini and chunkers.py
  • Removing legacy patterns using bare docling.parse in taxonomy.py, since the pdf parsed doc content is not necessary to be passed to DocumentChunker

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Mar 18, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 18, 2025
@mergify mergify bot removed the ci-failure label Mar 18, 2025
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify mergify bot added the one-approval label Mar 19, 2025
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cloned this and ran it locally with some markdowns that were erroring out previously due to a Docling bug around unescaped headings. With this updated Docling version, those markdowns are now chunking properly.

Also, I looked at the change to the chunking test and it looks reasonable.

The only thing I'd ask, which could be done as a follow-up PR, is that we add a simplified example of markdown that failed with our previous docling version and that will pass with this new docling version. That's just to prevent regression here, but we don't have to hold up merging this PR itself for that unless that's quick and easy.

@mergify mergify bot removed the one-approval label Mar 19, 2025
Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just a few nits but won't block on these. Thanks!

Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 20, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 21, 2025
Signed-off-by: Khaled Sulayman <[email protected]>
@ktdreyer
Copy link
Contributor

@khaledsulayman noticed e2e fails here. Thanks @courtneypacheco for looking into this.

Since @eshwarprasadS created this PR from his fork, GitHub will not take the changes to .github/workflows/e2e-nvidia-t4-x1.yml into account.

To get the e2e tests to run with proper credentials on this PR's changes:

  1. Create a new work-in-progress branch within this repo (not a fork). The wip branch should be based on main. You can name it docling-version-bump.
  1. Merge this PR's contents to docling-version-bump.
  2. Open a new PR from docling-version-bump to main.

Then you should be able to run the CI changes in this PR before merging to main.

@bbrowning
Copy link
Contributor

We should be able to remove the constraints.txt and workflow changes here, as there was a bug in Python SetupTools 77.0.3 with DeepSpeed that is now resolved with a newer Python SetupTools that our most recent CI builds are picking up. See deepspeedai/DeepSpeed#7165 for other reports of this, but I've since seen our CI pass as the jobs are now picking up SetupTools 78.x.

Copy link
Contributor

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @eshwarprasadS please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added needs-rebase and removed ci-failure labels Mar 26, 2025
Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>
@mergify mergify bot removed the needs-rebase label Mar 26, 2025
@eshwarprasadS eshwarprasadS merged commit 2cc9889 into instructlab:main Mar 26, 2025
28 checks passed
@bbrowning
Copy link
Contributor

@Mergifyio backport release-v0.7

Copy link
Contributor

mergify bot commented Mar 31, 2025

backport release-v0.7

✅ Backports have been created

bbrowning added a commit that referenced this pull request Mar 31, 2025
Update Docling version and improve OCR options handling with new docling ver. (backport #574)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants