You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have couple of questions on those parameters for above attachment.
# Q1 : Why is text being split into multiple columns specially Arzt and Col4# Q2: Does join_x_tolerance have any effect on this?importpymupdfdoc=pymupdf.open('allg_DP_Zwickau (1).pdf')
page=doc[2]
tabs=page.find_tables(strategy='text',clip=(0,page.search_for('Arzt')[0][1],700,page.rect[3]),snap_y_tolerance=7,join_x_tolerance=10)
tabs.tables[0].to_pandas()
Output:
# Q1 : Why Col5 is present when I am explicitly specifying the vertical lines?# is first column will be between 42.52 and 63.92? and second column between 63.92 to 106.44?doc=pymupdf.open('allg_DP_Zwickau (1).pdf')
page=doc[3]
vertical_lines= [42.52,63.92,106.44,148.96,367.22,460]
print(page.get_text('words',clip=(0,page.search_for('Arzt')[0][1],700,page.search_for('Arzt')[0][1]+30)))
tabs=page.find_tables(horizontal_strategy='text',clip=(0,page.search_for('Arzt')[0][1],700,page.rect[3]),vertical_lines=vertical_lines ,snap_y_tolerance=7)
tabs.tables[0].to_pandas()
Output:
# Q1 : Why Col5 is present when I am explicitly specifying the lines?# Q2 : Can we use the add_boxes param in this PDF example?doc=pymupdf.open('allg_DP_Zwickau (1).pdf')
page=doc[3]
top=page.search_for('Arzt')[0][1]
y_b=page.rect.y1print(page.get_text('words',clip=(0,page.search_for('Arzt')[0][1],700,page.search_for('Arzt')[0][1]+30)))
lines= [
((42.52,top),(42.52,y_b)),
((63.92,top),(63.92,y_b)),
((106.44,top),(106.44,y_b)),
((148.96,top),(148.96,y_b)),
((367.22,top),(367.22,y_b)),
((460,top),(460,y_b)),
]
tabs=page.find_tables(horizontal_strategy='text',clip=(0,page.search_for('Arzt')[0][1],700,page.rect[3]),add_lines=lines,snap_y_tolerance=7)
tabs.tables[0].to_pandas()
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
allg_DP_Zwickau (1).pdf
I have couple of questions on those parameters for above attachment.
Output:

Output:

Output:

I will attach the Jupyter notebook and PDF file as a zip file here.
Invoice Data Extraction.zip
Beta Was this translation helpful? Give feedback.
All reactions