Tokens detected by PyMuPDF are not visible on the PDF page #3030

rudra0713 · 2024-01-13T06:59:17Z

rudra0713
Jan 13, 2024

Hi, I have been experimenting with PyMuPDF recently and I started by drawing boxes around the detected tokens. For one of the pages in my PDF document, I noticed that some of the returned token boxes are completely empty in the pdf but PyMuPDF was showing some tokens corresponding to those boxes.

I have drawn a subset of the returned token boundaries on the pdf page and attached the image here. Between tokens "2020" and "2019", there are a bunch of green boxes that according to PyMyPDF contain tokens (for example, "x", "y", "z"). Later, when I opened the PDF in a PDF editor and searched for those tokens ("x", "y", "z"), the PDF editor found those tokens inside those boxes.

From this experiment, I think PyMuPDF is doing a good job of determining the tokens, but for my downstream task, I would like to discard tokens that are not visible on the PDF page. Is there a way to achieve this?

I am also attaching the original pdf, the image shown here is page 20 of the pdf.
Teck-2022-Sustainability-Report.pdf

JorjMcKie · 2024-01-13T07:08:24Z

JorjMcKie
Jan 13, 2024
Maintainer

This is no issue but a typical Discussions item. Converting.

0 replies

JorjMcKie · 2024-01-13T07:16:27Z

JorjMcKie
Jan 13, 2024
Maintainer

You did not provide the file in question. An image is no help for clarifying what is going on.
There are more than one possible reason for why this happens: the text may white on white background, or behind the pages background color, be written with no rendering, etc.
Impossible to tell like that.

0 replies

rudra0713 · 2024-01-19T12:13:38Z

rudra0713
Jan 19, 2024
Author

Hi @JorjMcKie, sorry for the late response. I had to resolve some proprietary issues regarding the pdf first. I have uploaded the pdf in my original post.

2 replies

JorjMcKie Jan 19, 2024
Maintainer

Sorry: still not downloadable, just points to the URL of this issue 😒.

rudra0713 Jan 19, 2024
Author

Sorry again, can you kindly recheck now?

JorjMcKie · 2024-01-19T21:53:48Z

JorjMcKie
Jan 19, 2024
Maintainer

Well, these boxes

are not empty: copy and paste delivers:

Obviously, it was the PDF creator's intention to it have it exactly like that.

1 reply

rudra0713 Jan 19, 2024
Author

Yes, I agree with you, but since the tokens themselves are not visible on the surface of the PDF, is there a way to achieve the same result when extracting the token boundaries?

JorjMcKie · 2024-01-20T14:00:51Z

JorjMcKie
Jan 20, 2024
Maintainer

After investigating several alternatives for hiding objects, I must admit: I see no way to ignore these items on this page.
Sorry about that.

1 reply

rudra0713 Jan 20, 2024
Author

Hi @JorjMcKie, thanks a lot for letting me know.

Tokens detected by PyMuPDF are not visible on the PDF page #3030

Uh oh!

Uh oh!

rudra0713 Jan 13, 2024

Replies: 5 comments · 4 replies

Uh oh!

JorjMcKie Jan 13, 2024 Maintainer

Uh oh!

JorjMcKie Jan 13, 2024 Maintainer

Uh oh!

rudra0713 Jan 19, 2024 Author

Uh oh!

JorjMcKie Jan 19, 2024 Maintainer

Uh oh!

rudra0713 Jan 19, 2024 Author

Uh oh!

JorjMcKie Jan 19, 2024 Maintainer

Uh oh!

rudra0713 Jan 19, 2024 Author

Uh oh!

JorjMcKie Jan 20, 2024 Maintainer

Uh oh!

rudra0713 Jan 20, 2024 Author

rudra0713
Jan 13, 2024

Replies: 5 comments 4 replies

JorjMcKie
Jan 13, 2024
Maintainer

JorjMcKie
Jan 13, 2024
Maintainer

rudra0713
Jan 19, 2024
Author

JorjMcKie Jan 19, 2024
Maintainer

rudra0713 Jan 19, 2024
Author

JorjMcKie
Jan 19, 2024
Maintainer

rudra0713 Jan 19, 2024
Author

JorjMcKie
Jan 20, 2024
Maintainer

rudra0713 Jan 20, 2024
Author