text alignment #3022

shachimaurya23 · 2024-01-12T07:43:32Z

shachimaurya23
Jan 12, 2024

Hi, I have been working with PyMuPDF to write translated text in Arabic to the pdf but the alignment of the text needs to be to the right of the bbox. I have tried many ways to do it but nothing seems to work . Please help me to understand why is the text not getting right aligned even after using right-align and matrix. Even the paragraph text goes out of the bbox or overlaps sometimes. The red outline in test_ar.pdf shows the bbox to which the text need to be right aligned.
test.pdf
test_ar.pdf

JorjMcKie · 2024-01-12T08:06:28Z

JorjMcKie
Jan 12, 2024
Maintainer

This is a Discussions item. So transferring it for further communication.

0 replies

JorjMcKie · 2024-01-12T08:09:48Z

JorjMcKie
Jan 12, 2024
Maintainer

What method to write text did you use? Page.insert_text(), Page.insert_textbox(), TextWriter.append(... right_to_left=True), TextWriter.fill_textbox(... right_to_left=True), Page.insert_htmlbox() ...

8 replies

shachimaurya23 Jan 12, 2024
Author

I reshapes the arabic text like keeping the words together and spacing betweeen characters where needed to make sense out of the translated text

JorjMcKie Jan 12, 2024
Maintainer

Well, the MuPDF code underneath TextWriter has very limited capabilities to deal with text in complex scripting systems like Arabic, Persian, and even more so Devanagari and such.
This has been overcome - I think completely - with the advent of the Story class.
We have implemented a new textbox-output method based on Story that can deal with all languages and scripting systems. For this it uses the text shaping software HarfBuzz - which obsoletes using any other text re-shaper package.

So may I recommend to use it instead of Textwriter? Here is an example:

"""
Output Arabic text using insert_htmlbox.
"""
import fitz

text = """
محافظة الأقصر هي محافظة مصرية تقع في إقليم جنوب الصعيد وتتوزع مراكزها ومدنها على ضفتي نهر النيل وعاصمتها هي مدينة الأقصر التي كانت قديماً تمثل مدينة طيبة عاصمة مصر خلال عدة حقب فرعونية. أنشئت المحافظة طبقاً للقرار الجمهوري رقم 378 لسنة 2009 الصادر في 9 ديسمبر 2009. تمتلك المحافظة مجموعة نادرة من الأماكن الأثرية التي لا يزال الكثير منها محتفظاً بحالته ومن أشهرها: معبد الأقصر، معابد الكرنك، مقابر وادي الملوك، وادي الملكات، المعابد الجنائزية، معبد إسنا وغيرها، بجانب القطع الأثرية الفريدة التي يعرضها متحف الأقصر. ظلت عاصمة المحافظة (طيبة) عاصمة لمصر حتى بداية الأسرة السادسة الفرعونية، حين انتقلت العاصمة إلى (منف) في الشمال. استُلهم اسم المحافظة من اسم عاصمتها الأقصر التي تعددت أسماؤها على مر العصور، فكان أشهرها مدينة المائة باب، مدينة الشمس، مدينة النور، مدينة الصولجان "واست"، وأطلق عليها العرب "الأقصر" لكثرة قصورها "معابدها". تجذب الأقصر شريحة كبيرة من السياحة الوافدة إلى مصر، لما تمتلكه من تراث إنساني ساهم بشكل كبير في ربط الحاضر مع الحضارة المصرية القديمة، واختيرت عاصمةً للسياحة العالمية لعام 2016، وعاصمة للثقافة العربية عام 2017. كان العيد القومي للمحافظة هو يوم صدور قرار تأسيسها في 9 ديسمبر وذلك حتى 20 مايو 2015 الذي وافق فيه مجلس الوزراء على تعديل موعد الاحتفال بالعيد القومي للمحافظة إلى 4 نوفمبر من كل عام، وهو اليوم الذي يوافق ذكرى اكتشاف مقبرة توت عنخ آمون. تقع المحافظة على مساحة 2959.6 كم² بين خطي عرض 36-25 شمالاً، و33-32 شرقاً، وتنقسم إدارياً إلى مدينة الأقصر، مركز البياضية، مركز القرنة، مركز أرمنت، مركز الطود، مركز إسنا، مدينة الأقصر الجديدة، مدينة طيبة الجديدة، ويحدها شمالاً محافظة قنا وجنوباً محافظة أسوان وشرقاً محافظة البحر الأحمر وغرباً محافظة الوادي الجديد. يسيطر على الأقصر مناخ صحراوي جاف قليل الأمطار إلا أنه أحياناً تسقط الأمطار بغزارة على التلال والجبال بالصحراء الشرقية في أوقات عدم استقرار الجو مسببة ما يعرف بالسيول ويختلف المناخ مع تعاقب الفصول الأربعة فيتميز الشتاء والصيف باستقرار الأحوال الجوية ويغلب على الربيع والخريف الأحوال الجوية غير المستقرة والرياح الشديدة المحملة بالأتربة.
"""
text = text.replace("\n", "<br>")  # line breaks must be adjusted
doc = fitz.open()
page = doc.new_page()
rect = (50, 50, 540, 500)
page.insert_htmlbox(rect, text, css="* {text-align: right}")
doc.save(__file__.replace(".py", ".pdf"))

Delivers this:

Please note that the text variable can also contain any HTML tags, including CSS-like styling instructions.
Alternatively or in addition you can supply styling instructions as indicated via the css parameter, which causes the "global" right alignment.
In the same way, your text can be any mixture of arbitrary text, and you can cause different alignment within it.
Rotation by a multiple of 90° is supported - as well as automatic scaling down all content if it doesn't fit in the supplied rectangle.

There simply is no hope to ultimately do this with TextWriter, as it will always be an output method based on single characters / Unicodes.

Here is the documentation which also points to examples.
If you use the newest PyMuPDF version 1.23.9, the method is readily available. If you cannot upgrade, you must use import fitz_new as fitz to get access to method Page.insert_htmlbox.

shachimaurya23 Jan 12, 2024
Author

Thank you for the help I will try and let you know how it goes

JorjMcKie Jan 12, 2024
Maintainer

Great. If you don't do that already: you should make sure to use doc.subset_fonts() before saving (which should include compression and garbage collection).

And if you are ok with the standard / Google Noto fonts, you never will need to fiddle around with supplying fonts. But of course you can if needed.

shachimaurya23 Jan 12, 2024
Author

It works like a magic thank you so much for telling me about insert_htmlbox.

shachimaurya23 · 2024-01-16T11:24:11Z

shachimaurya23
Jan 16, 2024
Author

RTL is not working when we use insert_htmlbox() where the sentence has mix of Arabic or any RTL language and Latin(English)
like if the text was
A B C [space] ג ב א [space] D E F

then as per RTL it show have been
DEF אבג ABC

but it actually writes
ABC אבג DEF

5 replies

JorjMcKie Jan 16, 2024
Maintainer

then as per RTL it show have been
DEF אבג ABC

Why?
If you have a mixture of ltr and rtl text pieces, it is not a priori clear what should be done.
For me, both alternatives look to be possible.

shachimaurya23 Jan 16, 2024
Author

But when we use TextWriter.append(... right_to_left=True) or TextWriter.fill_textbox(... right_to_left=True) the rtl generated for the text is
DEF אבג ABC

which according to me is better than the other way round where only the Arabic text is RTL as the reader would read the whole sentence as RTL and this makes more sense.

shachimaurya23 Jan 16, 2024
Author

Is there any way to get

DEF אבג ABC

output with right alignment of text using PyMuPDF

JorjMcKie Jan 16, 2024
Maintainer

As you indicate yourself:
If a text has an ltr / rtl mixture, there is no way to conclude programmatically whether the complete string should be treated as rtl or ltr. It could be English text with some Hebrew or the other way round.
In this case you must feed this information into the process - as you do it when using TextWriter specifying the respective parameter.
The only recommendation I currently have is that you split a text string by space character and then join it again with intervening spaces in reversed sequence ... like that:

text_list = text.split(" ")
rtl_text = " ".join(reversed(text_list))

shachimaurya23 Jan 17, 2024
Author

but this will also effect the Arabic or Hebrew text when they have space between them

JorjMcKie · 2024-01-17T08:50:45Z

JorjMcKie
Jan 17, 2024
Maintainer

but this will also effect the Arabic or Hebrew text when they have space between them

True, the algorithm is a bit more complex, because you first have to take back the word splitting for any two adjacent RTL words. Hope, the following does this:

def is_rtl(word):  # check if a word contains characters from an RTL alphabet
    r0 = range(0x0590, 0x0780)  # Hebrew / Arabic / Persian Unicode range
    if any([ord(c) in r0 for c in word]):
        return True
    return False

words = text.split(" ")
new_words = [words[0]]

for w in words:
    w0 = new_words[-1]  # previous word
    if is_rtl(w0) and is_rtl(w):  # both words are RTL
        new_words[-1] = w0 + " " + w  # replace 
        continue
    new_words.append(w)

rtl_text = " ".join(reversed(new_words))

0 replies

shachimaurya23 · 2024-01-17T11:11:55Z

shachimaurya23
Jan 17, 2024
Author

here insert_htmlbox() fails to put the text within the bbox . It is not scaling the text according to the bbox size. Why is that?

3 replies

JorjMcKie Jan 17, 2024
Maintainer

show me the source code pls

shachimaurya23 Jan 17, 2024
Author

it is the same
start_x=bbox.left start_y=bbox.top rect=fitz.Rect(start_x, start_y, start_x+bbox.width, start_y+bbox.height) pdf_page.insert_htmlbox(rect,text,rotate=rotate_value,scale_low=0)

JorjMcKie Jan 17, 2024
Maintainer

I know what it is: If the last word of a line is longer than the rectangle width, this is unfortunately not detected (by the underlying Story object) - yet.
We are investigating options to tackle this.
There is a way to allow the Story using additional line breaks by using soft hyphens like in the following demo:

import fitz

r1 = (100, 100, 120, 120)
r2 = (100, 130, 120, 150)
text1 = "short treatment"
text2 = "short treat&shy;ment"
doc = fitz.open()
page = doc.new_page()
page.insert_htmlbox(r1, text1)
page.insert_htmlbox(r2, text2)
doc.save("short.pdf")

JorjMcKie · 2024-01-17T19:33:04Z

JorjMcKie
Jan 17, 2024
Maintainer

A hint from the MuPDF colleagues:
This is the intermediate solution before making changes to the Story code - with or without using a soft hyphen.

import fitz

r1 = (100, 100, 120, 120)  # very small rectangles of equal size
r2 = (100, 130, 120, 150)
text1 = "short treatment"  # some word fill not fit in rect width
text2 = "short treat&shy;ment"  # offer optional word break via soft hyphen
doc = fitz.open()
page = doc.new_page()
page.insert_htmlbox(r1, text1, css="* {overflow-wrap: break-word;}")
page.insert_htmlbox(r2, text2)
page.draw_rect(r1, color=(1, 0, 0))
page.draw_rect(r2, color=(1, 0, 0))
doc.save(__file__.replace(".py", ".pdf"))

Result:

0 replies

JorjMcKie · 2024-01-17T21:47:04Z

JorjMcKie
Jan 17, 2024
Maintainer

Goodnews I think:

The MuPDF team has developed a solution for the issue discussed here. With one of the next PyMuPDF versions, you will be able to choose between 3 different ways to solve this:

import fitz

RED = fitz.pdfcolor["red"]
r1 = (100, 100, 120, 120)
r2 = (100, 130, 120, 150)
r3 = (100, 160, 120, 180)
text1 = "short treatment"
text2 = "short treat&shy;ment"
text3 = text1
doc = fitz.open()
page = doc.new_page()
page.insert_htmlbox(r1, text1, css="* {overflow-wrap: break-word;}")
page.insert_htmlbox(r2, text2)
page.insert_htmlbox(r3, text3)  # NEW behavior
page.draw_rect(r1, color=RED)
page.draw_rect(r2, color=RED)
page.draw_rect(r3, color=RED)
doc.save(__file__.replace(".py", ".pdf"))

The last rectangle shows the new possibility:

2 replies

shachimaurya23 Jan 18, 2024
Author

How do I use the 3rd way without changing the rectangle dimensions for now? Do I need to wait for the new version to come out to use it? currently using v1.23.9

JorjMcKie Jan 18, 2024
Maintainer

Yes, to use the 3rd way, we have to wait for a change in MuPDF - PyMuPDF cannot address this.

text alignment #3022

Uh oh!

Uh oh!

shachimaurya23 Jan 12, 2024

Replies: 7 comments · 18 replies

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 12, 2024 Author

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 12, 2024 Author

Uh oh!

JorjMcKie Jan 12, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 12, 2024 Author

Uh oh!

Uh oh!

shachimaurya23 Jan 16, 2024 Author

Uh oh!

JorjMcKie Jan 16, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 16, 2024 Author

Uh oh!

shachimaurya23 Jan 16, 2024 Author

Uh oh!

JorjMcKie Jan 16, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 17, 2024 Author

Uh oh!

JorjMcKie Jan 17, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 17, 2024 Author

Uh oh!

JorjMcKie Jan 17, 2024 Maintainer

Uh oh!

Uh oh!

shachimaurya23 Jan 17, 2024 Author

Uh oh!

JorjMcKie Jan 17, 2024 Maintainer

Uh oh!

JorjMcKie Jan 17, 2024 Maintainer

Uh oh!

JorjMcKie Jan 17, 2024 Maintainer

Uh oh!

shachimaurya23 Jan 18, 2024 Author

Uh oh!

JorjMcKie Jan 18, 2024 Maintainer

shachimaurya23
Jan 12, 2024

Replies: 7 comments 18 replies

JorjMcKie
Jan 12, 2024
Maintainer

JorjMcKie
Jan 12, 2024
Maintainer

shachimaurya23 Jan 12, 2024
Author

JorjMcKie Jan 12, 2024
Maintainer

shachimaurya23 Jan 12, 2024
Author

JorjMcKie Jan 12, 2024
Maintainer

shachimaurya23 Jan 12, 2024
Author

shachimaurya23
Jan 16, 2024
Author

JorjMcKie Jan 16, 2024
Maintainer

shachimaurya23 Jan 16, 2024
Author

shachimaurya23 Jan 16, 2024
Author

JorjMcKie Jan 16, 2024
Maintainer

shachimaurya23 Jan 17, 2024
Author

JorjMcKie
Jan 17, 2024
Maintainer

shachimaurya23
Jan 17, 2024
Author

JorjMcKie Jan 17, 2024
Maintainer

shachimaurya23 Jan 17, 2024
Author

JorjMcKie Jan 17, 2024
Maintainer

JorjMcKie
Jan 17, 2024
Maintainer

JorjMcKie
Jan 17, 2024
Maintainer

shachimaurya23 Jan 18, 2024
Author

JorjMcKie Jan 18, 2024
Maintainer