Description
Feel free to close this if there's nothing to be done about it. I saw that Annex 29 repeatedly excludes the zero-width non-joiner.
In Persian, this character (U+200C
) is used to prevent connection of letters between certain prefixes and suffixes, and the words to which they are attached. I know it has other purposes in other languages, but Persian is what I'm working with. (I also work in Arabic, where the ZWNJ is not used in any context that I know of.)
I was tinkering with a Rust program that involves (among other things) taking Arabic or Persian text input and segmenting the graphemes. Once I found this package, it worked immediately, with few exceptions. And I understood the exceptions that occurred. For example, if an Arabic letter is followed by a vowel mark or other diacritic, those code points stay together as a unit. That seems right, since the letter plus diacritic(s) can be said to represent the "user-perceived character."
But I have a problem with the ZWNJ in Persian. It does not create a new "user-perceived character" along with the preceding letter—which is how it's being treated in this segmentation scheme. Rather, the intention is, "act as though there's a space after this letter, but leave out the space."
At issue is the fact that letters in the Arabic or Persian alphabet have up to four contextual forms: isolated, initial, medial, and final. As you probably know, setting the correct form in a given context tends to be taken care of by the shaping engine. (Otherwise, typing would be incredibly tedious.) When a ZWNJ is added, it's an instruction not to use the medial form of the preceding letter, where it might otherwise be used. The result is that one of the other standard forms will be set instead, depending on the context.
When segmenting graphemes in Persian, then, I don't think it makes sense to exclude the ZWNJ as a boundary. It would better be segmented out, the way that spaces are. In fact, unless I've missed something, U+200C
could be treated as a grapheme boundary when it occurs after any code point in the Arabic block. (It should not, however, be treated as a word or sentence boundary by default.)
But I could be wrong. There are people who would know better. And if the mandate here is to follow Annex 29 faithfully, then I suppose it doesn't matter. I found a workaround for my immediate purposes.
Thank you for your work on this project!