refactor: add the contentlayer to html-backend #1040

PeterStaar-IBM · 2025-02-23T06:02:36Z

New feature

Adding content-layer to html_backend in order to remove furniture of websites

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

mergify · 2025-02-23T06:03:10Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

docling/backend/html_backend.py

ceberam

It would also be necessary to rebase to main to resolve the merge conflict and to pick up the last commit, which includes another doc.add_text(...) call. And ensure a conventional commit title for this PR.

Another concern: this code will put everything to furniture before the first <h1> tag appears (if there is any). That may be a strong assumption. For instance, the example reported in the issue #1019 has no <h1> tag.
Instead, we could put everything in the body container (we only parse the <body> tag anyway) except specific tags like footnote or aside (which are not yet supported by the backend anyway).

Signed-off-by: Peter Staar <[email protected]>

Signed-off-by: Cesar Berrospi Ramis <[email protected]>

In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <[email protected]>

ceberam · 2025-02-28T17:18:22Z

@PeterStaar-IBM I have done some changes on branch dev/use-furniture-in-html for this PR:

rebased on latest main and force-pushed the amended commits
add an additional commit since:
- the code on this PR was putting all the content before the first header into furniture, but many documents may not have any header. This could also create confusion between the semantic of the body tag in HTML and the docling's body content layer. What I did is to put all the parsed content in the content layer body if the HTML document has no header tag.
- I reverted the DocItemLabel.PARAGRAPH label for docling paragraphs (instead of DocItemLabel.TEXT) since I assumed the latter was a mistake.

Signed-off-by: Cesar Berrospi Ramis <[email protected]>

PeterStaar-IBM self-assigned this Feb 23, 2025

PeterStaar-IBM marked this pull request as ready for review February 24, 2025 11:37

ceberam reviewed Feb 24, 2025

View reviewed changes

docling/backend/html_backend.py Show resolved Hide resolved

docling/backend/html_backend.py Outdated Show resolved Hide resolved

ceberam requested changes Feb 24, 2025

View reviewed changes

PeterStaar-IBM and others added 4 commits February 28, 2025 15:44

added the contentlayer to html-backend

252bd83

Signed-off-by: Peter Staar <[email protected]>

updated the handle_image function

e5e0067

Signed-off-by: Peter Staar <[email protected]>

reformatted code of html backend

0cba30e

Signed-off-by: Peter Staar <[email protected]>

test(html): add more info if a test case fails

70e6b94

Signed-off-by: Cesar Berrospi Ramis <[email protected]>

ceberam force-pushed the dev/use-furniture-in-html branch from e96ed30 to 70e6b94 Compare February 28, 2025 15:06

ceberam changed the title ~~Added the contentlayer to html-backend~~ refactor: add the contentlayer to html-backend Feb 28, 2025

chore: set TextItem label to 'text' instead of 'paragraph'

3ea31b6

Signed-off-by: Cesar Berrospi Ramis <[email protected]>

ceberam force-pushed the dev/use-furniture-in-html branch from 93d380b to 3ea31b6 Compare March 1, 2025 14:30

ceberam approved these changes Mar 1, 2025

View reviewed changes

PeterStaar-IBM requested review from cau-git and dolfim-ibm March 2, 2025 15:36

cau-git approved these changes Mar 2, 2025

View reviewed changes

PeterStaar-IBM merged commit e25d557 into main Mar 2, 2025
10 checks passed

PeterStaar-IBM deleted the dev/use-furniture-in-html branch March 2, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: add the contentlayer to html-backend #1040

refactor: add the contentlayer to html-backend #1040

Uh oh!

PeterStaar-IBM commented Feb 23, 2025

Uh oh!

mergify bot commented Feb 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ceberam left a comment •

edited

Loading

Uh oh!

ceberam commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

refactor: add the contentlayer to html-backend #1040

refactor: add the contentlayer to html-backend #1040

Uh oh!

Conversation

PeterStaar-IBM commented Feb 23, 2025

Uh oh!

mergify bot commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

Uh oh!

Uh oh!

ceberam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceberam commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 23, 2025 •

edited

Loading

ceberam left a comment •

edited

Loading