QUESTION: how can I access the text matrix? #3071

libTorrentUser · 2024-01-20T16:21:39Z

libTorrentUser
Jan 20, 2024

Is it possible to access the text matrix?

I have a PDF file where I'm only interested in the parts of the text that are in Italic. But this particular PDF file uses matrix transformations to render the Italic text. That means I cannot rely on the flags to tell whether the text is in Italic or not.

When I use mupdf mutool trace I can see that the <span> tag generated for the lines that are rendered in Italic have a specific value for the trm attribute. Unfortunately that attribute is not put inside pymupdf span dictionary. You guys seem to use it, only it is not put inside the dictionary.

Is there any other property that can give me the information I need?

JorjMcKie · 2024-01-20T18:17:44Z

JorjMcKie
Jan 20, 2024
Maintainer

Sorry, no there is no way to access this.
To include it in the "dict" output would be out of discussion anyway. The only potential candidate would be get_texttrace() which is directly based on the same data that mutool trace ... accesses.

0 replies

libTorrentUser · 2024-01-21T19:52:37Z

libTorrentUser
Jan 21, 2024
Author

Oh well, parsing the trace output it is then :)

But I am on the right track, right? I mean, when a PDF file does something like this (fake font styles with fancy rendering techniques), is looking at matrix transformations inside a BT ET block the way to go?

0 replies

JorjMcKie · 2024-01-21T20:56:07Z

JorjMcKie
Jan 21, 2024
Maintainer

Oh well, parsing the trace output it is then :)

But I am on the right track, right? I mean, when a PDF file does something like this (fake font styles with fancy rendering techniques), is looking at matrix transformations inside a BT ET block the way to go?

You cold use an XML package and read the trace output.
Although slower and a bit clumsy, it probably less hacky than digging your way through all this chain of potential matrices and push / pop instructions ("q"/"Q").
And once you have reached b"BT", you still have the decoding in front of you, which you need to find your way back from the glyph numbers you will be encountering after the BT to the Unicodes that they represent.
I really cannot recommend that approach!
Better intercept the mutool trace output in one way or another. Maybe you could also look at the source behind get_texttrace() and make a version of it.
There still are enough problems to solve on that path.

0 replies

libTorrentUser · 2024-01-22T20:55:35Z

libTorrentUser
Jan 22, 2024
Author

Thanks for the help. What I ended up doing, at least for now, was to use mupdf directly, the C API I mean. Using it I can create a device and override its fill_text function to access everything I need.

The PDF file I have here seems to be quite simple (if we ignore the way it Italics :) )and luckily that makes the device fill_text callbacks to be made in the reading order. Inside the callback I have access to the fully decoded chars and the matrix, so it should not be too hard.

Ideally, and this is what I plan on doing next, it is probably best to create a stext device, override its fill_text callback and, inside my implementation, first let the original callback do its thing and only then take a look at the matrix and, if it indicates Italic text, put some sort of flag somewhere. The advantage of letting the original callback do its thing is that it will take care of dealing with the character positions and will create proper text blocks, so I won't have to worry in case a PDF file does something weird.

The only problem with this approach is that the callback might (and most of the times will) be called several times until a text block is fully formed, so I'm also going to have to deal with that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QUESTION: how can I access the text matrix? #3071

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

QUESTION: how can I access the text matrix? #3071

Uh oh!

libTorrentUser Jan 20, 2024

Replies: 4 comments

Uh oh!

JorjMcKie Jan 20, 2024 Maintainer

Uh oh!

libTorrentUser Jan 21, 2024 Author

Uh oh!

JorjMcKie Jan 21, 2024 Maintainer

Uh oh!

libTorrentUser Jan 22, 2024 Author

libTorrentUser
Jan 20, 2024

JorjMcKie
Jan 20, 2024
Maintainer

libTorrentUser
Jan 21, 2024
Author

JorjMcKie
Jan 21, 2024
Maintainer

libTorrentUser
Jan 22, 2024
Author