Skip to content
Discussion options

You must be logged in to vote

Tables in PDF pages are just text - which is the reason for the existence of so many tools that desperately try to identify page areas that might contain tabular data.

Compared to text extraction, the effort to identify tables is enormous. Consequently, text extraction does not contain a "table avoidance" feature. It is you who must add logic to your text extraction that ignores characters from within areas that previously have been identified as tables.

So you should first identify the tables on the page. This gives you a bbox for each table. You can sort the sequence of those bboxes in the usual way if needed: first by bottom, then by left coordinate.

When extracting text, a frequent is…

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@JorjMcKie
Comment options

Answer selected by reema93jain
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #2930 on December 22, 2023 09:17.