-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: add the contentlayer to html-backend #1040
base: main
Are you sure you want to change the base?
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
||
doc.add_picture( | ||
parent=self.parents[self.level], | ||
caption=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caption=None, | |
caption=caption, |
docling/backend/html_backend.py
Outdated
label = DocItemLabel.PARAGRAPH | ||
label = DocItemLabel.TEXT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using DocItemLabel.PARAGRAPH
if we are parsing content enclosed with the HTML paragraph tag <p>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be necessary to rebase to main
to resolve the merge conflict and to pick up the last commit, which includes another doc.add_text(...)
call. And ensure a conventional commit title for this PR.
Another concern: this code will put everything to furniture before the first <h1>
tag appears (if there is any). That may be a strong assumption. For instance, the example reported in the issue #1019 has no <h1>
tag.
Instead, we could put everything in the body container (we only parse the <body>
tag anyway) except specific tags like footnote
or aside
(which are not yet supported by the backend anyway).
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
e96ed30
to
70e6b94
Compare
In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <[email protected]>
@PeterStaar-IBM I have done some changes on branch
|
New feature
Checklist: