Extract text from "flattened" PDF? #310

jasonharitou · 2024-04-18T15:16:26Z

jasonharitou
Apr 18, 2024

Hi,

I'm attempting to extract text from a PDF, but it's not being extracted. I've tried with multiple other PDFs and the text extracts fine, so there's something particular about this PDF in question.

Could anyone suggest some ways I can debug why the text is not being extracted? For reference the PDF is a single page, version 1.4, Author is Canva, and Security is None.

File itself is 1.8MB.

Any help would be appreciated. Thanks

Answered by mara004

Apr 18, 2024

AFAICS, the canva_flattened PDF just doesn't have a text layer, so there is nothing for a pdf engine to extract. You could use an OCR tool like ocrmypdf to change that, though.
Also note that canva_full is a vector PDF, whereas canva_flattened merely contains a rasterized image of the former.
For this reason, I recommend just not using this "flattening", as it evidently rasterizes, which is inelegant (maybe you've noticed the file size of canva_flattened is way higher than of canva_full).

View full answer

jasonharitou · 2024-04-18T16:11:40Z

jasonharitou
Apr 18, 2024
Author

Did a bit more testing with Canva - when a document is "flattened" during the export process, it becomes unreadable by pypdfium2. Please see the two attached files.
canva-flattened.pdf
canva-full.pdf

0 replies

mara004 · 2024-04-18T16:48:03Z

mara004
Apr 18, 2024
Maintainer

AFAICS, the canva_flattened PDF just doesn't have a text layer, so there is nothing for a pdf engine to extract. You could use an OCR tool like ocrmypdf to change that, though.
Also note that canva_full is a vector PDF, whereas canva_flattened merely contains a rasterized image of the former.
For this reason, I recommend just not using this "flattening", as it evidently rasterizes, which is inelegant (maybe you've noticed the file size of canva_flattened is way higher than of canva_full).

2 replies

jasonharitou Apr 18, 2024
Author

Yeah I noticed the full version was massive! Thanks for your answer, OCR tools here I come…

mara004 Apr 18, 2024
Maintainer

(note for readers: canva's flattening should not be confused with PDF form field flattening as provided by pdfium)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract text from "flattened" PDF? #310

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Extract text from "flattened" PDF? #310

jasonharitou Apr 18, 2024

Replies: 2 comments · 2 replies

jasonharitou Apr 18, 2024 Author

mara004 Apr 18, 2024 Maintainer

jasonharitou Apr 18, 2024 Author

mara004 Apr 18, 2024 Maintainer

jasonharitou
Apr 18, 2024

Replies: 2 comments 2 replies

jasonharitou
Apr 18, 2024
Author

mara004
Apr 18, 2024
Maintainer

jasonharitou Apr 18, 2024
Author

mara004 Apr 18, 2024
Maintainer