Arabic: 'source' Unicode to final display

2019-03-21 12:29发布

simple question:

this is the final display string I am looking for

لعبة ديدة

now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)

ل ع ب ة د ي د ة

note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.

and then in that above, the characters are actually appearing right to left (in memory, they are left to right)

so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?

that's all I want, one function that does that.

UPDATE:

ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)

so, can I just put my unicode into a function and get the transformed unicode out?

标签: c++ c arabic
5条回答
小情绪 Triste *
2楼-- · 2019-03-21 13:07

The joining and RTL conversion don't happen at the level of Unicode characters.

In other words: the order of the characters and the actual unicode codepoints are not changed during this process.

In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.

This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:

Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.

查看更多
别忘想泡老子
4楼-- · 2019-03-21 13:24

I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.

This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).

查看更多
Rolldiameter
5楼-- · 2019-03-21 13:29

The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.

note how they are NOT the same characters

They are the same for an Arabic reader. It is still readable. There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.

Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.

查看更多
狗以群分
6楼-- · 2019-03-21 13:30

What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.

Some points:

At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.

At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.

To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms. For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.

Here are a couple of pointers for more information on the topic:

查看更多
登录 后发表回答