Fully correct Unicode visual string reversal

2019-07-18 04:28发布

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]

Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.

This pseudo-algorithm handles all the easy cases I know about:

  1. Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
  2. Reverse the order of this list.
  3. For each string in the list:
    1. Segment the string into grapheme clusters.
    2. Reverse the order of the grapheme clusters.
    3. Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
    4. Join the sequence back into a string.
  4. Recombine the list of reversed words to produce the final reversed string.

... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

0条回答
登录 后发表回答