Issue when extracting markdown from a PDF containing a table on a background #114

Jimmy-Newtron · 2024-08-26T11:47:12Z

Jimmy-Newtron
Aug 26, 2024

I am debugging since a while and I have difficulties to understand one point.

My PDF has been created from an export of a PowerPoint ppx presentation in a landscape layout.

Various slides are containing tables that are correctly converted to markdown tables.
Unfortunately the table is above a background image/vg so the output_images method is extracting the text in a reading order that it loose all its meaning.

Why the output_images does not exclude the tables area?

JorjMcKie · 2024-08-26T11:51:13Z

JorjMcKie
Aug 26, 2024
Maintainer

Please provide the example file / page. The verbal description alone is not enough.

0 replies

Jimmy-Newtron · 2024-08-26T12:08:25Z

Jimmy-Newtron
Aug 26, 2024
Author

from-ppt.pdf

Gives the following output:

|Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|


This is a first paragraph easy to be extracted

Start 1st 2nd ... nth Some Last
multiline

A B C D E F G

test Next Is Badly
line Aligned

With

example J Z

Other
Text


-----

1 reply

Jimmy-Newtron Aug 26, 2024
Author

I would only expect the title to be in the text part and avoid duplicate text extraction that can increment RAG noise

JorjMcKie · 2024-08-26T13:32:33Z

JorjMcKie
Aug 26, 2024
Maintainer

This example does everything it can to confuse the algorithm:

there is a full page image
vector graphics are drawn upon that image background
text and a table is written above the vector graphics

The algorithm looks at all content on a page (text, vector graphics and images) and tries to make sense out of the text positions in relation to all content.

There currently are choices to

Either output images and vector graphics or refer to them via markdown references. In the latter case, any text that happens to live on image-covered areas will appear as part of these images.
Suppress text that is above images / vector graphics, This only is allowed if images / vector graphics are extracted as images (previous option).

So either you get what you are complaining about, or you get this output:

|Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|


![](from-ppt.pdf-0-0.png)

-----

when you execute .tomarkdown(doc, write_images=True, force_text=False).

You can influence the output further if you remove images from a page, and only let text and vector graphics survive:

from pathlib import Path

import pymupdf
import pymupdf4llm

doc = pymupdf.open("from-ppt.pdf")
for page in doc:
    # remove any image on all pages
    page.add_redact_annot(page.rect)
    page.apply_redactions(
        images=pymupdf.PDF_REDACT_IMAGE_REMOVE, # remove images
        graphics=pymupdf.PDF_REDACT_LINE_ART_NONE, # don't touch vector graphics
        text=pymupdf.PDF_REDACT_TEXT_NONE, # don't touch text
    )
# now make markdown extraction from image-free pages
md = pymupdf4llm.to_markdown(doc)
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())

This is the output:

This is a first paragraph easy to be extracted

|Start|1st|2nd|...|nth|Some multiline|Last|
|---|---|---|---|---|---|---|
|A|B|C|D|E|F|G|
|test||Next line|||Is Badly Aligned With Other Text||
||example|||J||Z|


-----

5 replies

Jimmy-Newtron Aug 26, 2024
Author

I agree that the page is very confusing for the algorithm.

I was expecting that the table position (bbox) would be of higher priority respect VG and image text, since the TableFinder does some assumptions to extract tables using the page layout.

Am I wrong?

Jimmy-Newtron Aug 26, 2024
Author

from-ppt-vg.pdf

What should I do if the background is a VG instead of a picture?

JorjMcKie Aug 26, 2024
Maintainer

Full page vg should be ignored.

JorjMcKie Aug 26, 2024
Maintainer

The sequence of stuff in the output is determined by the bottom coordinates, so the table is written first.

Jimmy-Newtron Aug 27, 2024
Author

Full page vg should be ignored.

In my example PDF from-ppt-vg.pdf the vg is actually a smaller than the full page, but is larger than the table.

I am a little surprised that in the case that a vg includes a table, the table content text is not excluded from the vg text in the output_images method

RAG/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py

Line 599 in 78952f1

tab_rects={}, # we have no tables here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue when extracting markdown from a PDF containing a table on a background #114

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Issue when extracting markdown from a PDF containing a table on a background #114

Uh oh!

Jimmy-Newtron Aug 26, 2024

Replies: 3 comments · 6 replies

Uh oh!

JorjMcKie Aug 26, 2024 Maintainer

Uh oh!

Jimmy-Newtron Aug 26, 2024 Author

Uh oh!

Jimmy-Newtron Aug 26, 2024 Author

Uh oh!

JorjMcKie Aug 26, 2024 Maintainer

Uh oh!

Jimmy-Newtron Aug 26, 2024 Author

Uh oh!

Jimmy-Newtron Aug 26, 2024 Author

Uh oh!

JorjMcKie Aug 26, 2024 Maintainer

Uh oh!

JorjMcKie Aug 26, 2024 Maintainer

Uh oh!

Jimmy-Newtron Aug 27, 2024 Author

Jimmy-Newtron
Aug 26, 2024

Replies: 3 comments 6 replies

JorjMcKie
Aug 26, 2024
Maintainer

Jimmy-Newtron
Aug 26, 2024
Author

Jimmy-Newtron Aug 26, 2024
Author

JorjMcKie
Aug 26, 2024
Maintainer

Jimmy-Newtron Aug 26, 2024
Author

Jimmy-Newtron Aug 26, 2024
Author

JorjMcKie Aug 26, 2024
Maintainer

JorjMcKie Aug 26, 2024
Maintainer

Jimmy-Newtron Aug 27, 2024
Author