Replies: 2 comments 3 replies
-
|
If the bad text is actual visible PDF text objects, I would not expect a qpdf overlay of another invisible text layer to reliably change what gets selected/copied. PDF viewers generally see multiple text objects in the page content, and selection order is not a simple "new layer wins" rule. OCRmyPDF has two relevant modes, but they solve different cases:
So I would try: ocrmypdf --redo-ocr input.pdf output.pdfonly if the bad copied text is really an old hidden OCR layer. If the PDF came from Word and the bad text is the visible embedded text itself, ocrmypdf --force-ocr -O 2 input.pdf output.pdfTo control size, experiment with optimization level, output type, and image quality options, or regenerate the PDF from the original source if possible. But "keep all original visible text/vector content unchanged while making a new OCR layer take precedence for copy/highlight" is not something I would rely on as a supported OCRmyPDF/qpdf workflow. |
Beta Was this translation helpful? Give feedback.
-
|
This is why I was hoping for some alternatives. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've got a file (from someone else) which has "actual" text in it, like a Word file turned into a PDF. The problem is that upon copying the text, it's garbage. For example, if I use
pdftotxton it, I get output like "-n99QJ:4Y=Z� ^n?hjZ� Q4pZZ� h?YjZ� QZ99QJGZ� =J� hZ==JhF4?� :ZW?� ]ZhhZ� J�WJ?J� =Zp?`J".I can of course use
--force-ocrto fix this, but that blows up the file size immensely. I was hoping I could just grab the text out of that process and useqpdfto overlay it on the original, but that doesn't work. The original text is still what gets copied or highlighted when I try this on the resultant file.Is there a way to do this? That is, force the OCRed text to sit on top of the "actual" text, so it gets copied/highlighted?
Beta Was this translation helpful? Give feedback.
All reactions