Skip to content

Convert Markdown document incorrect #623

Closed
@kime541200

Description

@kime541200

Bug

Convert Markdown document error.
...

Steps to reproduce

Original content of the Markdown document is something like:

# ABCDEFG
- abc:
	- abc123:
		- abc1234:
			- abc12345:
				- a.
				- b.
		- abcd1234:
			- abcd12345:
				- a.
				- b.
- def:
	- def1234:
		- def12345。
- ghijkl

Here's the convert process:

$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG

- abc:
  - abc123:
    - abc1234:
      - abc12345:
        - a.
        - b.
      - abcd1234:
        - abcd12345:
          - a.
          - b.
- def:
  - def1234:
    - def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
 <BlankLine children=[]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.

And here's the final result I got:

$ cat test2.md
# ABCDEFG

- abc:
- def:
- ghijkl

I also try to use python library to convert this document, but I still got same output.

In final result, a lot content is not been output, did I do anything wrong?

PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.

Docling version

$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

$ python --version
Python 3.11.10

Metadata

Metadata

Labels

bugSomething isn't workingmarkdownissue related to markdown backend

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions