Replies: 5 comments 4 replies
-
This is no issue but a typical Discussions item. Converting. |
Beta Was this translation helpful? Give feedback.
-
You did not provide the file in question. An image is no help for clarifying what is going on. |
Beta Was this translation helpful? Give feedback.
-
Hi @JorjMcKie, sorry for the late response. I had to resolve some proprietary issues regarding the pdf first. I have uploaded the pdf in my original post. |
Beta Was this translation helpful? Give feedback.
-
Well, these boxes Obviously, it was the PDF creator's intention to it have it exactly like that. |
Beta Was this translation helpful? Give feedback.
-
After investigating several alternatives for hiding objects, I must admit: I see no way to ignore these items on this page. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I have been experimenting with PyMuPDF recently and I started by drawing boxes around the detected tokens. For one of the pages in my PDF document, I noticed that some of the returned token boxes are completely empty in the pdf but PyMuPDF was showing some tokens corresponding to those boxes.
I have drawn a subset of the returned token boundaries on the pdf page and attached the image here. Between tokens "2020" and "2019", there are a bunch of green boxes that according to PyMyPDF contain tokens (for example, "x", "y", "z"). Later, when I opened the PDF in a PDF editor and searched for those tokens ("x", "y", "z"), the PDF editor found those tokens inside those boxes.
From this experiment, I think PyMuPDF is doing a good job of determining the tokens, but for my downstream task, I would like to discard tokens that are not visible on the PDF page. Is there a way to achieve this?

I am also attaching the original pdf, the image shown here is page 20 of the pdf.
Teck-2022-Sustainability-Report.pdf
Beta Was this translation helpful? Give feedback.
All reactions