Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve markdown list parser #1039

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tobiasstrebitzer
Copy link
Contributor

Description of the changes:

  • Improve MarkdownDocumentBackend to set the correct list type to ordered or unordered
  • Improve MarkdownDocumentBackend to parse all child elements of a list item
  • Improve MarkdownDocumentBackend to fix exception when a list item doesn't contain any children.

Known limitations:

  • The markdown parser still needs work, I believe there's a vast amount of edge cases that are not currently handled by the parser.

Issue resolved by this Pull Request:

Resolves #913
Resolves #851

Checklist:

  • Documentation has not been updated (no changes).
  • Examples have not been added (not necessary).
  • Tests have been added and truth data updated (ordered list example).
  • Code style and formatting applied via pre-commit hook.

Signed-off-by: Tobias Strebitzer <[email protected]>
Copy link

mergify bot commented Feb 23, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@vagenas vagenas self-assigned this Feb 25, 2025
@vagenas
Copy link
Contributor

vagenas commented Feb 27, 2025

Hi @tobiasstrebitzer

Thanks for your PR.

For some recent context, with docling-core v2.21.0, we introduced a couple relevant improvements:

  1. inline groups, for capturing things like the inline code in #913, i.e. groups whose contents are to be printed in single line
  2. Markdown export logic is more structured

Now I see you were initially targeting both #851 & #913 with this PR.

  • #913 is IMO relatively straightforward
  • #851 on the other hand may still require some more minor conceptual work on our side to be properly addressed.

With all that in mind, and to keep PRs as local as possible, I propose focusing this PR only on #913 and solving that by using the newly introduced inline groups (example) as needed, i.e. differentiating between blocks and inline components (e.g. code block vs inline code) (assuming Marko makes this differentiation possible).

This would definitely be a nice feature to have!

Would you like to take it up since you already started working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Markdown parser only considers first child of ListItem Export to markdown treats ordered lists as unordered
2 participants