Not able to parse information in correct sequence #2935

reema93jain · 2023-12-21T18:05:27Z

reema93jain
Dec 21, 2023

Description of the bug

Hi Team,

I am using PyMuPDF to parse data from pdf which contains text, table and image.

When I am trying to use below code just for parsing text, I am able to parse text in right sequence:

def extract_text_from_pdf(pdf_path):
import fitz
doc = fitz.open(pdf_path)
text = ''

for page_number in range(doc.page_count):
    page = doc[page_number]
    text += page.get_text()
doc.close()
return text

However, when I trying to alter the code as below, I am getting tables content listed twice(one by get_text function and other by .find_tables() function). Also, I am not getting text and tables in correct sequence. Is there any way I can parse the table data just once?

import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd

def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)

# Initialize variables to store extracted data
parsed_data=[]

for page_num in range(doc.page_count):
    page = doc[page_num]

    # Extract text
    text = page.get_text()
    if text:
        parsed_data.append({'type': 'text', 'content': text})
    
    #Find Tables
    tabs = page.find_tables()
    #print(tabs)
    if tabs:
        
        for tab in tabs:
            table=[]
            for line in tab.extract():
                table.append(line)
            parsed_data.append({'type': 'table', 'content': table})

doc.close()
return parsed_data

Calling the function:

pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

Access the parsed data & display it

for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)

print()

Can you please advise how I can parse text,table,images in correct sequence using PyMuPDF?

Thank you
Reema Jain

How to reproduce the bug

Complete Code:
!pip install fitz
!pip install PyMuPDF
!pip install PyMuPDF Pillow

import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import io
from io import BytesIO

def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)

# Initialize variables to store extracted data
parsed_data=[]

for page_num in range(doc.page_count):
    page = doc[page_num]

    # Extract text
    text = page.get_text()
    if text:
        parsed_data.append({'type': 'text', 'content': text})
    
    #Find Tables
    tabs = page.find_tables()
    #print(tabs)
    if tabs:
        
        for tab in tabs:
            table=[]
            for line in tab.extract():
                table.append(line)
            parsed_data.append({'type': 'table', 'content': table})

doc.close()
return parsed_data

Calling the function:

pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]

Access the parsed data & display it

for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)

print()

PyMuPDF version

1.23.7

Operating system

Windows

Python version

3.10

Answered by JorjMcKie

Dec 23, 2023

Tables in PDF pages are just text - which is the reason for the existence of so many tools that desperately try to identify page areas that might contain tabular data.

Compared to text extraction, the effort to identify tables is enormous. Consequently, text extraction does not contain a "table avoidance" feature. It is you who must add logic to your text extraction that ignores characters from within areas that previously have been identified as tables.

So you should first identify the tables on the page. This gives you a bbox for each table. You can sort the sequence of those bboxes in the usual way if needed: first by bottom, then by left coordinate.

When extracting text, a frequent is…

View full answer

JorjMcKie · 2023-12-22T09:17:36Z

JorjMcKie
Dec 22, 2023
Maintainer

This is no issue but a Discussions item.
Transferring ...

0 replies

JorjMcKie · 2023-12-23T09:09:03Z

JorjMcKie
Dec 23, 2023
Maintainer

Tables in PDF pages are just text - which is the reason for the existence of so many tools that desperately try to identify page areas that might contain tabular data.

Compared to text extraction, the effort to identify tables is enormous. Consequently, text extraction does not contain a "table avoidance" feature. It is you who must add logic to your text extraction that ignores characters from within areas that previously have been identified as tables.

So you should first identify the tables on the page. This gives you a bbox for each table. You can sort the sequence of those bboxes in the usual way if needed: first by bottom, then by left coordinate.

When extracting text, a frequent issue is that it is not always stored in "natural" reading sequence. Here again, you must add logic to establish that sequence by sorting the output to your liking.
This can be challenging - think of multi-column pages: you would at first have to identify the bboxes of those columns and then sort the extracted text separately for each column.

For the most simple case, I suggest you build a list of table bboxes and then extract the pages text using an output option that delivers text coordinates.

table_bboxes = []  # assume a list of fitz.Rect items wrapping a table each

def in_table(bbox):  # check if bbox inside a table
    rect = fitz.Rect(bbox)
    for tbbox in table_bboxes:
        if rect in tbbox:
            return True
    return False

lines = []  # will contain our lines outside any table

for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:  # look at text blocks only
    for line in block["lines"]:
        if in_table(line["bbox"]):
            continue  # this line is in a table
        lines.append(line)

lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0]))  # sort lines in reading sequence

for line in lines:
    text = "".join([s["text"] for s in l["spans"]])  # build the line text
    print(text)

1 reply

JorjMcKie Dec 24, 2023
Maintainer

Wait - I should have thought about this in the first place:

A much simpler approach is this:

Find and process the tables on a page.
Erase the text of all tables (using redactions annotations).
Extract the text on the page.

for bbox on table_bboxes:
    page.add_redact_annot(bbox)  # wrap the table by a redaction annotation

page apply_redactions()  # remove all table text

page.extract_text(sort=True)

That's it!

reema93jain · 2024-01-05T21:27:48Z

reema93jain
Jan 5, 2024
Author

Hi JorjMcKie,

Thanks so much for the response. I tried both solutions and solution 1 is working for me with only issue is that text data is not spaced properly but I am able fetch text and table data only one and in sequence.

I wanted to also fetch image data along with text & table in sequence. When I tried adding image on below code, I am just getting repeated images and that not in alignment with text & table data on pdf.

Is there anything which I can try for images to be able to parse them in sequence along with text and table data? Below is my code as per your first response:

import fitz # PyMuPDF
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

def in_table(tab_bbox, line_bbox):
tab_rect = fitz.Rect(tab_bbox)
line_rect = fitz.Rect(line_bbox)
return line_rect.intersects(tab_rect)

def parse_pdf(pdf_path):
# Open pdf
doc = fitz.open(pdf_path)

#parsed_data=[]

# Iterate through all pages in the PDF document
for page_num in range(doc.page_count):
    page = doc[page_num]

    # Extract text
    lines = []  # will contain our lines outside any table

    # Find Tables
    tabs = page.find_tables()

    if tabs:
        for tab in tabs:
            #table=[]
            for line in tab.extract():
                lines.append(line)

    for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:  # look at text blocks only
        for line in block["lines"]:
            if any(in_table(tab.bbox, line["bbox"]) for tab in tabs):
                continue  # this line is in a table
            lines.append(line)
    #print(lines)
    #lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0]))  # sort lines in reading sequence
    
    #Checking if each line is list or dict. If its dict, print text if not, find tables as dataframe
    text2=""
    dlist=[]
    flag=0
    for line in lines:
        if isinstance(line, list):
            flag=flag+1
            if flag==1:
                columns = line
                #print(columns)
                flag=2
            else:
                dlist.append(line) 
                #print(dlist)
        if isinstance(line, dict) and "spans" in line:
            if dlist:
                    df = pd.DataFrame(data=dlist, columns=columns)
                    print(df)    
                    flag=0
                    dlist=[]
            for span in line['spans']:
                text = span["text"]
                #text2="".join(text)
                text2=text2+text+'\n'
    if dlist:
        df = pd.DataFrame(data=dlist, columns=columns)
        print(df)
        flag=0
        dlist=[]
    
    print(text2)
            
    
    # Extract images
    parsed_data=[]
    images = page.get_images(full=True)
    if images:
        for img_index, img in enumerate(images):
            img_bytes = doc.extract_image(img[0])
            image = Image.open(BytesIO(img_bytes["image"]))
            parsed_data.append({'type': 'image', 'content': image})
            
    for entry in parsed_data:
        if entry['type'] == 'image':
            image=entry['content']
           #print(entry['content'])
            plt.imshow(image)
            plt.show()
doc.close()

Calling the function

pdf_path = "EOS-User-Manual.pdf"
parse_data = parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100]
print(parsed_data)
print()

Second solution is not working for me. Finding tables and applying redactions is deleting all of tables data. I even tried calling get_text function & find_tables function after applying redactions on tables, but I am just seeing text data only.

Thank you
Reema Jain

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not able to parse information in correct sequence #2935

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Not able to parse information in correct sequence #2935

Uh oh!

reema93jain Dec 21, 2023

Description of the bug

Calling the function:

Access the parsed data & display it

How to reproduce the bug

Calling the function:

Access the parsed data & display it

PyMuPDF version

Operating system

Python version

Replies: 3 comments · 1 reply

Uh oh!

JorjMcKie Dec 22, 2023 Maintainer

Uh oh!

JorjMcKie Dec 23, 2023 Maintainer

Uh oh!

JorjMcKie Dec 24, 2023 Maintainer

Uh oh!

reema93jain Jan 5, 2024 Author

Calling the function

reema93jain
Dec 21, 2023

Replies: 3 comments 1 reply

JorjMcKie
Dec 22, 2023
Maintainer

JorjMcKie
Dec 23, 2023
Maintainer

JorjMcKie Dec 24, 2023
Maintainer

reema93jain
Jan 5, 2024
Author