Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: add the contentlayer to html-backend #1040

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

PeterStaar-IBM
Copy link
Contributor

New feature

  1. Adding content-layer to html_backend in order to remove furniture of websites

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@PeterStaar-IBM PeterStaar-IBM self-assigned this Feb 23, 2025
Copy link

mergify bot commented Feb 23, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM PeterStaar-IBM marked this pull request as ready for review February 24, 2025 11:37

doc.add_picture(
parent=self.parents[self.level],
caption=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
caption=None,
caption=caption,

Comment on lines 207 to 262
label = DocItemLabel.PARAGRAPH
label = DocItemLabel.TEXT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using DocItemLabel.PARAGRAPH if we are parsing content enclosed with the HTML paragraph tag <p> ?

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be necessary to rebase to main to resolve the merge conflict and to pick up the last commit, which includes another doc.add_text(...) call. And ensure a conventional commit title for this PR.

Another concern: this code will put everything to furniture before the first <h1> tag appears (if there is any). That may be a strong assumption. For instance, the example reported in the issue #1019 has no <h1> tag.
Instead, we could put everything in the body container (we only parse the <body> tag anyway) except specific tags like footnote or aside (which are not yet supported by the backend anyway).

@ceberam ceberam force-pushed the dev/use-furniture-in-html branch from e96ed30 to 70e6b94 Compare February 28, 2025 15:06
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.

Signed-off-by: Cesar Berrospi Ramis <[email protected]>
@ceberam ceberam changed the title Added the contentlayer to html-backend refactor: add the contentlayer to html-backend Feb 28, 2025
@ceberam
Copy link
Contributor

ceberam commented Feb 28, 2025

@PeterStaar-IBM I have done some changes on branch dev/use-furniture-in-html for this PR:

  • rebased on latest main and force-pushed the amended commits
  • add an additional commit since:
    • the code on this PR was putting all the content before the first header into furniture, but many documents may not have any header. This could also create confusion between the semantic of the body tag in HTML and the docling's body content layer. What I did is to put all the parsed content in the content layer body if the HTML document has no header tag.
    • I reverted the DocItemLabel.PARAGRAPH label for docling paragraphs (instead of DocItemLabel.TEXT) since I assumed the latter was a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants