Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown parser only considers first child of ListItem #913

Open
vagenas opened this issue Feb 7, 2025 · 4 comments · May be fixed by #1039
Open

Markdown parser only considers first child of ListItem #913

vagenas opened this issue Feb 7, 2025 · 4 comments · May be fixed by #1039
Assignees
Labels
bug Something isn't working markdown issue related to markdown backend

Comments

@vagenas
Copy link
Contributor

vagenas commented Feb 7, 2025

Issue originally posted by @Heremeus in #623:

Similar issue is happening with inline code using `

Converting the following markdown file and exporting it back to markdown using DocumentConverter().convert("file.md").document.export_to_markdown() results in docling cutting off the text after the `

Input:

# Contributing

1. Pull the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

Exported Markdown:

 # Contributing

- Pull the repository
- Create your feature branch (
- Commit your changes (
- Push to the branch (
- Open a Pull Request

Also, from their follow-up comment:

Did a little debugging and it seems this stems from the md_backend. The handling for marko.block.ListItem only considers the first children, ignoring any other children of the ListItem.

snippet_text = str(element.children[0].children[0].children)

In my example above, element.children[0] is a Paragraph containing multiple RawText and CodeSpan children. element.children[0].children[0] only uses the first RawText child and ignores the rest of the Paragraph.

@vagenas vagenas added bug Something isn't working markdown issue related to markdown backend labels Feb 7, 2025
@vagenas vagenas self-assigned this Feb 7, 2025
@ruizguille
Copy link

A similar issue is happening if list items include bold or italic text.

And, in particular, for an input like this:

# Example

1. **Item 1 bold**: Item 1 description
2. **Item 2 bold**: Item 2 description
3. **Item 3 bold**: Item 3 description

The markdown exported output is:

# Example

- [<RawText children='Item 1 bold'>]
- [<RawText children='Item 2 bold'>]
- [<RawText children='Item 3 bold'>]

@tobiasstrebitzer tobiasstrebitzer linked a pull request Feb 23, 2025 that will close this issue
4 tasks
@PeterStaar-IBM
Copy link
Contributor

@vagenas Can we considered this closed? I think you fixed this now, no?

@vagenas
Copy link
Contributor Author

vagenas commented Feb 28, 2025

@PeterStaar-IBM no because this is about the Markdown backend (not the export). My comment here is related.

@PeterStaar-IBM
Copy link
Contributor

@vagenas Got it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working markdown issue related to markdown backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants