Extract text from "flattened" PDF? #310
-
Hi, I'm attempting to extract text from a PDF, but it's not being extracted. I've tried with multiple other PDFs and the text extracts fine, so there's something particular about this PDF in question. Could anyone suggest some ways I can debug why the text is not being extracted? For reference the PDF is a single page, version 1.4, Author is Canva, and Security is None. File itself is 1.8MB. Any help would be appreciated. Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Did a bit more testing with Canva - when a document is "flattened" during the export process, it becomes unreadable by pypdfium2. Please see the two attached files. |
Beta Was this translation helpful? Give feedback.
-
AFAICS, the |
Beta Was this translation helpful? Give feedback.
AFAICS, the
canva_flattened
PDF just doesn't have a text layer, so there is nothing for a pdf engine to extract. You could use an OCR tool like ocrmypdf to change that, though.Also note that
canva_full
is a vector PDF, whereascanva_flattened
merely contains a rasterized image of the former.For this reason, I recommend just not using this "flattening", as it evidently rasterizes, which is inelegant (maybe you've noticed the file size of
canva_flattened
is way higher than ofcanva_full
).