Fully correct Unicode visual string reversal

2019-07-18 04:28发布

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]

Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.

This pseudo-algorithm handles all the easy cases I know about:

Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)

Reverse the order of this list.

For each string in the list:

Segment the string into grapheme clusters.

Reverse the order of the grapheme clusters.

Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)

Join the sequence back into a string.

Recombine the list of reversed words to produce the final reversed string.

... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

标签： unicode language-agnostic

0条回答

Fully correct Unicode visual string reversal

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间