Issue when extracting markdown from a PDF containing a table on a background #114
Replies: 3 comments 6 replies
-
Please provide the example file / page. The verbal description alone is not enough. |
Beta Was this translation helpful? Give feedback.
-
Gives the following output: |Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|
This is a first paragraph easy to be extracted
Start 1st 2nd ... nth Some Last
multiline
A B C D E F G
test Next Is Badly
line Aligned
With
example J Z
Other
Text
----- |
Beta Was this translation helpful? Give feedback.
-
This example does everything it can to confuse the algorithm:
The algorithm looks at all content on a page (text, vector graphics and images) and tries to make sense out of the text positions in relation to all content. There currently are choices to
So either you get what you are complaining about, or you get this output: |Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|

----- when you execute You can influence the output further if you remove images from a page, and only let text and vector graphics survive: from pathlib import Path
import pymupdf
import pymupdf4llm
doc = pymupdf.open("from-ppt.pdf")
for page in doc:
# remove any image on all pages
page.add_redact_annot(page.rect)
page.apply_redactions(
images=pymupdf.PDF_REDACT_IMAGE_REMOVE, # remove images
graphics=pymupdf.PDF_REDACT_LINE_ART_NONE, # don't touch vector graphics
text=pymupdf.PDF_REDACT_TEXT_NONE, # don't touch text
)
# now make markdown extraction from image-free pages
md = pymupdf4llm.to_markdown(doc)
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode()) This is the output: This is a first paragraph easy to be extracted
|Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|
-----
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am debugging since a while and I have difficulties to understand one point.
My PDF has been created from an export of a PowerPoint ppx presentation in a landscape layout.
Various slides are containing tables that are correctly converted to markdown tables.
Unfortunately the table is above a background image/vg so the
output_images
method is extracting the text in a reading order that it loose all its meaning.Why the
output_images
does not exclude the tables area?Beta Was this translation helpful? Give feedback.
All reactions