[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]
Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.
This pseudo-algorithm handles all the easy cases I know about:
- Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
- Reverse the order of this list.
- For each string in the list:
- Segment the string into grapheme clusters.
- Reverse the order of the grapheme clusters.
- Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
- Join the sequence back into a string.
- Recombine the list of reversed words to produce the final reversed string.
... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?