Put invisible text over actual text? #1670

Jmuccigr · 2026-04-21T15:36:06Z

Jmuccigr
Apr 21, 2026

I've got a file (from someone else) which has "actual" text in it, like a Word file turned into a PDF. The problem is that upon copying the text, it's garbage. For example, if I use pdftotxt on it, I get output like "-n99QJ:4Y=Z� ^n?hjZ� Q4pZZ� h?YjZ� QZ99QJGZ� =J� hZ==JhF4?� :ZW?� ]ZhhZ� J�WJ?J� =Zp?`J".

I can of course use --force-ocr to fix this, but that blows up the file size immensely. I was hoping I could just grab the text out of that process and use qpdf to overlay it on the original, but that doesn't work. The original text is still what gets copied or highlighted when I try this on the resultant file.

Is there a way to do this? That is, force the OCRed text to sit on top of the "actual" text, so it gets copied/highlighted?

cookesan · 2026-04-21T16:58:55Z

cookesan
Apr 21, 2026

If the bad text is actual visible PDF text objects, I would not expect a qpdf overlay of another invisible text layer to reliably change what gets selected/copied. PDF viewers generally see multiple text objects in the page content, and selection order is not a simple "new layer wins" rule.

OCRmyPDF has two relevant modes, but they solve different cases:

--redo-ocr is for replacing an existing OCR layer where OCRmyPDF can identify it as OCR text.
--force-ocr rasterizes the page and creates a new OCR layer, which is why it fixes copy/paste in cases like this, but also why file size can grow and vector/text content can be lost.

So I would try:

ocrmypdf --redo-ocr input.pdf output.pdf

only if the bad copied text is really an old hidden OCR layer. If the PDF came from Word and the bad text is the visible embedded text itself, --redo-ocr may preserve that visible text rather than replacing it. In that case --force-ocr is the reliable OCRmyPDF route:

ocrmypdf --force-ocr -O 2 input.pdf output.pdf

To control size, experiment with optimization level, output type, and image quality options, or regenerate the PDF from the original source if possible. But "keep all original visible text/vector content unchanged while making a new OCR layer take precedence for copy/highlight" is not something I would rely on as a supported OCRmyPDF/qpdf workflow.

3 replies

Jmuccigr Apr 21, 2026
Author

Thanks for the reply.

I know about those options, which is why I mentioned --force-ocr. I should have added that I'd tried --redo-ocr without success. :-(

And, again, the PDF isn't from me, so I can't do anything about regenerating it from source. The text layer is really text because if I use, say, ghostscript to remove text from the file, that visible text disappears.

cookesan Apr 21, 2026

Given that clarification, I think this is the unsupported case: the bad content is visible/printable text, not an OCR layer OCRmyPDF can identify and replace.

That matches the current docs. --redo-ocr strips invisible OCR text and masks visible text before OCR; existing visible text objects are not changed. The docs also call out the hard case where OCR-like text is technically printable/visible, because OCRmyPDF cannot distinguish that from real text and will not redo it.

So if Ghostscript text filtering makes the visible text disappear, that explains why --redo-ocr cannot surgically fix it. The practical paths are basically:

--force-ocr / --mode force, with optimization and image-quality tuning, if you need an OCRmyPDF-supported result;
a custom destructive pipeline that removes all text and then grafts/overlays new OCR text back, accepting that this is outside OCRmyPDF safe batch behavior;
a feature request around replacing visible PDF text while preserving non-text page content, but I would not treat that as existing OCRmyPDF behavior.

This is also why the qpdf overlay approach does not quite solve it: it adds more text objects, but it does not remove or reliably deprioritize the original visible text objects for selection/copy.

cookesan Apr 22, 2026

Thanks for clarifying that the visible content is real printable text. That means --redo-ocr is expected not to replace it: OCRmyPDF's docs say redo mode removes invisible OCR text, masks out visible text, and OCRs the remaining page image; it cannot distinguish OCR-like text that is technically visible/printable from real text. Docs: https://ocrmypdf.readthedocs.io/en/stable/advanced.html#when-ocr-is-skipped

For that class of PDF, --force-ocr is the OCRmyPDF mode that can rebuild the text layer, because it rasterizes the page, discards existing text objects, and OCRs the rendered page. The tradeoff is unavoidable in OCRmyPDF: it will not preserve the original visible vector/text objects and also add a duplicate invisible OCR layer over them.

So if you need to preserve the original PDF's vector text exactly, OCRmyPDF probably is not the right tool for that file. If you can accept rasterizing the page appearance, reduce to one page first and test --force-ocr on that page so you can check size, appearance, and searchability before processing the whole document.

Jmuccigr · 2026-04-22T08:35:18Z

Jmuccigr
Apr 22, 2026
Author

This is why I was hoping for some alternatives.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Put invisible text over actual text? #1670

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Put invisible text over actual text? #1670

Uh oh!

Jmuccigr Apr 21, 2026

Replies: 2 comments · 3 replies

Uh oh!

cookesan Apr 21, 2026

Uh oh!

Jmuccigr Apr 21, 2026 Author

Uh oh!

cookesan Apr 21, 2026

Uh oh!

cookesan Apr 22, 2026

Uh oh!

Jmuccigr Apr 22, 2026 Author

Jmuccigr
Apr 21, 2026

Replies: 2 comments 3 replies

cookesan
Apr 21, 2026

Jmuccigr Apr 21, 2026
Author

Jmuccigr
Apr 22, 2026
Author