Update Docling version and improve OCR options handling with new docling ver. #574

eshwarprasadS · 2025-03-18T21:03:33Z

This PR intends to bump the version of docling from docling>=2.4.2,<=2.8.3 to docling>=2.18.0. This is to bring in the fix for the particular docling chunking failure issue on markdowns with unescaped special characters (docling-project/docling#823)

The primary changes are:

updates to requirements.txt
updates to CI environment handling in tox.ini and chunkers.py
Removing legacy patterns using bare docling.parse in taxonomy.py, since the pdf parsed doc content is not necessary to be passed to DocumentChunker

…th bumped ver. Signed-off-by: eshwarprasadS <[email protected]>

Signed-off-by: eshwarprasadS <[email protected]>

aakankshaduggal

Thanks @eshwarprasadS

src/instructlab/sdg/utils/taxonomy.py

bbrowning

I cloned this and ran it locally with some markdowns that were erroring out previously due to a Docling bug around unescaped headings. With this updated Docling version, those markdowns are now chunking properly.

Also, I looked at the change to the chunking test and it looks reasonable.

The only thing I'd ask, which could be done as a follow-up PR, is that we add a simplified example of markdown that failed with our previous docling version and that will pass with this new docling version. That's just to prevent regression here, but we don't have to hold up merging this PR itself for that unless that's quick and easy.

Signed-off-by: eshwarprasadS <[email protected]>

khaledsulayman

looks great just a few nits but won't block on these. Thanks!

src/instructlab/sdg/utils/taxonomy.py

Signed-off-by: eshwarprasadS <[email protected]>

Signed-off-by: Khaled Sulayman <[email protected]>

ktdreyer · 2025-03-24T14:30:51Z

@khaledsulayman noticed e2e fails here. Thanks @courtneypacheco for looking into this.

Since @eshwarprasadS created this PR from his fork, GitHub will not take the changes to .github/workflows/e2e-nvidia-t4-x1.yml into account.

To get the e2e tests to run with proper credentials on this PR's changes:

Create a new work-in-progress branch within this repo (not a fork). The wip branch should be based on main. You can name it docling-version-bump.

Merge this PR's contents to docling-version-bump.
Open a new PR from docling-version-bump to main.

Then you should be able to run the CI changes in this PR before merging to main.

bbrowning · 2025-03-25T20:38:18Z

We should be able to remove the constraints.txt and workflow changes here, as there was a bug in Python SetupTools 77.0.3 with DeepSpeed that is now resolved with a newer Python SetupTools that our most recent CI builds are picking up. See deepspeedai/DeepSpeed#7165 for other reports of this, but I've since seen our CI pass as the jobs are now picking up SetupTools 78.x.

Signed-off-by: eshwarprasadS <[email protected]>

mergify · 2025-03-26T17:58:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. @eshwarprasadS please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>

bbrowning · 2025-03-31T17:26:07Z

@Mergifyio backport release-v0.7

mergify · 2025-03-31T17:26:14Z

backport release-v0.7

✅ Backports have been created

#583 Update Docling version and improve OCR options handling with new docling ver. (backport #574) has been created for branch release-v0.7

Update Docling version and improve OCR options handling with new docling ver. (backport #574)

feat: update docling requirements and improve OCR options handling wi…

84f4c7f

…th bumped ver. Signed-off-by: eshwarprasadS <[email protected]>

eshwarprasadS requested review from bbrowning, aakankshaduggal and khaledsulayman March 18, 2025 21:03

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Mar 18, 2025

chore: linting..

7f18d28

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Mar 18, 2025

fix: make return type Optional for resolve_ocr_options

30a3310

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot removed the ci-failure label Mar 18, 2025

aakankshaduggal approved these changes Mar 19, 2025

View reviewed changes

mergify bot added the one-approval label Mar 19, 2025

courtneypacheco reviewed Mar 19, 2025

View reviewed changes

src/instructlab/sdg/utils/taxonomy.py Show resolved Hide resolved

bbrowning approved these changes Mar 19, 2025

View reviewed changes

mergify bot removed the one-approval label Mar 19, 2025

test: Add unit tests for _get_documents() method behaviors

246e71c

Signed-off-by: eshwarprasadS <[email protected]>

eshwarprasadS requested a review from courtneypacheco March 19, 2025 19:14

khaledsulayman reviewed Mar 19, 2025

View reviewed changes

src/instructlab/sdg/utils/taxonomy.py Outdated Show resolved Hide resolved

src/instructlab/sdg/utils/taxonomy.py Show resolved Hide resolved

ktdreyer mentioned this pull request Mar 20, 2025

Bump minimum version of SDG to 0.7.2 instructlab/instructlab#3238

Merged

chore: return type, comments

8f9910e

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Mar 20, 2025

khaledsulayman approved these changes Mar 21, 2025

View reviewed changes

mergify bot added ci-failure and removed ci-failure labels Mar 21, 2025

khaledsulayman force-pushed the docling-version-bump branch from 20eb2c1 to 0ba442a Compare March 21, 2025 19:23

mergify bot added ci-failure and removed ci-failure labels Mar 21, 2025

Update setuptools dependency

0cafab8

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the docling-version-bump branch from 0ba442a to 0cafab8 Compare March 21, 2025 19:58

mergify bot added ci-failure and removed ci-failure labels Mar 21, 2025

fix: delete constriants, upgrade setuptools ver, CI fix

9074bc3

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added needs-rebase and removed ci-failure labels Mar 26, 2025

Merge branch 'main' into docling-version-bump

e0c469d

Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>

mergify bot removed the needs-rebase label Mar 26, 2025

eshwarprasadS merged commit 2cc9889 into instructlab:main Mar 26, 2025
28 checks passed

mergify bot mentioned this pull request Mar 31, 2025

Update Docling version and improve OCR options handling with new docling ver. (backport #574) #583

Merged

bbrowning added a commit that referenced this pull request Mar 31, 2025

Merge pull request #583 from instructlab/mergify/bp/release-v0.7/pr-574

c535560

Update Docling version and improve OCR options handling with new docling ver. (backport #574)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Docling version and improve OCR options handling with new docling ver. #574

Update Docling version and improve OCR options handling with new docling ver. #574

Uh oh!

eshwarprasadS commented Mar 18, 2025

Uh oh!

aakankshaduggal left a comment

Uh oh!

Uh oh!

bbrowning left a comment

Uh oh!

khaledsulayman left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ktdreyer commented Mar 24, 2025

Uh oh!

bbrowning commented Mar 25, 2025

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

Uh oh!

bbrowning commented Mar 31, 2025

Uh oh!

mergify bot commented Mar 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update Docling version and improve OCR options handling with new docling ver. #574

Update Docling version and improve OCR options handling with new docling ver. #574

Uh oh!

Conversation

eshwarprasadS commented Mar 18, 2025

Uh oh!

aakankshaduggal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

khaledsulayman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ktdreyer commented Mar 24, 2025

Uh oh!

bbrowning commented Mar 25, 2025

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

Uh oh!

bbrowning commented Mar 31, 2025

Uh oh!

mergify bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Uh oh!

khaledsulayman left a comment •

edited

Loading

mergify bot commented Mar 31, 2025 •

edited

Loading