Not able to parse information in correct sequence #2935
-
Description of the bugHi Team, I am using PyMuPDF to parse data from pdf which contains text, table and image. When I am trying to use below code just for parsing text, I am able to parse text in right sequence: def extract_text_from_pdf(pdf_path):
However, when I trying to alter the code as below, I am getting tables content listed twice(one by get_text function and other by .find_tables() function). Also, I am not getting text and tables in correct sequence. Is there any way I can parse the table data just once? import fitz # PyMuPDF def parse_pdf(pdf_path):
Calling the function:pdf_path = "EOS-User-Manual.pdf" Access the parsed data & display itfor entry in parsed_data:
Can you please advise how I can parse text,table,images in correct sequence using PyMuPDF? Thank you How to reproduce the bugComplete Code: import fitz # PyMuPDF def parse_pdf(pdf_path):
Calling the function:pdf_path = "EOS-User-Manual.pdf" Access the parsed data & display itfor entry in parsed_data:
PyMuPDF version1.23.7 Operating systemWindows Python version3.10 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
This is no issue but a Discussions item. |
Beta Was this translation helpful? Give feedback.
-
Tables in PDF pages are just text - which is the reason for the existence of so many tools that desperately try to identify page areas that might contain tabular data. Compared to text extraction, the effort to identify tables is enormous. Consequently, text extraction does not contain a "table avoidance" feature. It is you who must add logic to your text extraction that ignores characters from within areas that previously have been identified as tables. So you should first identify the tables on the page. This gives you a bbox for each table. You can sort the sequence of those bboxes in the usual way if needed: first by bottom, then by left coordinate. When extracting text, a frequent issue is that it is not always stored in "natural" reading sequence. Here again, you must add logic to establish that sequence by sorting the output to your liking. For the most simple case, I suggest you build a list of table bboxes and then extract the pages text using an output option that delivers text coordinates. table_bboxes = [] # assume a list of fitz.Rect items wrapping a table each
def in_table(bbox): # check if bbox inside a table
rect = fitz.Rect(bbox)
for tbbox in table_bboxes:
if rect in tbbox:
return True
return False
lines = [] # will contain our lines outside any table
for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]: # look at text blocks only
for line in block["lines"]:
if in_table(line["bbox"]):
continue # this line is in a table
lines.append(line)
lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0])) # sort lines in reading sequence
for line in lines:
text = "".join([s["text"] for s in l["spans"]]) # build the line text
print(text) |
Beta Was this translation helpful? Give feedback.
-
Hi JorjMcKie, Thanks so much for the response. I tried both solutions and solution 1 is working for me with only issue is that text data is not spaced properly but I am able fetch text and table data only one and in sequence. I wanted to also fetch image data along with text & table in sequence. When I tried adding image on below code, I am just getting repeated images and that not in alignment with text & table data on pdf. Is there anything which I can try for images to be able to parse them in sequence along with text and table data? Below is my code as per your first response: import fitz # PyMuPDF def in_table(tab_bbox, line_bbox): def parse_pdf(pdf_path):
Calling the functionpdf_path = "EOS-User-Manual.pdf" Second solution is not working for me. Finding tables and applying redactions is deleting all of tables data. I even tried calling get_text function & find_tables function after applying redactions on tables, but I am just seeing text data only. Thank you |
Beta Was this translation helpful? Give feedback.
Tables in PDF pages are just text - which is the reason for the existence of so many tools that desperately try to identify page areas that might contain tabular data.
Compared to text extraction, the effort to identify tables is enormous. Consequently, text extraction does not contain a "table avoidance" feature. It is you who must add logic to your text extraction that ignores characters from within areas that previously have been identified as tables.
So you should first identify the tables on the page. This gives you a bbox for each table. You can sort the sequence of those bboxes in the usual way if needed: first by bottom, then by left coordinate.
When extracting text, a frequent is…