Big unicode problems - AS3

2019-07-30 03:58发布

问题:

I made a program where people can type in 4 letters and it will give you the corresponding unicode character that it inserts in a textflow element. Now i had a lot of problems with this, but in the end i succeeded with some help. Now the problem came when i typed "dddd" or "ddd1" as a test.

I got the error - "An unpaired Unicode surrogate was encountered in the input."

Now i spend like 2 days testing for that, and there was absolutly no event triggering that made it possible for me to test for the error before it occurred.

The code:

str = "dddd"
num = parseInt(str,16)
res = String.fromCharCode(num)

Acutally when the error occurres res is equal to "?" in the console ... but if you test for it with if(res == "?") it returns false.

MY QUESTION: Now i searched and searched and found abolutly no description on this error in adobes as3 reference, but after 2 days i found this page for javascript: http://scripts.sil.org/cms/scripts/page.php?item_id=IWS-Chapter04a

It says that - The code units in the range 0xD800–0xDFFF, serve a special purpose, however. These code units, known as surrogate code units

So now i test with:

if( num > 0 && num < uint(0xD800)) || ( num > uint(0xDFFF) &&  num < uint(0xFFFF) ){

   get unicode character.
}

my question is simply if i understood this correctly, that this will actually prevent the error from occurring? - I'm no unicode specialist and don't know really how to test for it, since there are ten's of thousands characters so i might have missed one and that would mean that the users by accident could get the error and risk crashing the application.

回答1:

You are correct. A code point ("high surrogate") between 0xD800-0xDBFF must be paired with a code point ("low surrogate") between 0xDC00-0xDFFF. Those are reserved for use in UTF-16[1] - when needing to address the higher planes that don't fit in 16 bits - and hence those code points can't appear on their own. For example:

0xD802 DC01 corresponds to (I'll leave out the 0x hex markers):

  10000 + (high - D800) * 0400 + (low  - DC00)
  10000 + (D802 - D800) * 0400 + (DC01 - DC00) 
= 10000 +         0002  * 0400 +         0001 
= 10801 expressed as UTF-16

... just adding that bit of into in case you later need to support it.

I haven't tested the AS3 functionality for the following, but you may want to also test the input below - you won't get the surrogate error for these, but might get another error message:

  • 0xFFFE and 0xFFFF (when using higher planes, also any code point "ending" with those bits, e.g. 0x1FFFE and 0x1FFFF; 0x2FFFE and 0x2FFFF etc.) Those are "non-characters".
  • The same goes for 0xFDD0-0xFEDF - also "non-characters".

  1. AS3 actually uses UTF-16 to store its strings, but even if it didn't, the surrogate code points would still have no meaning outside pairs - the code points are reserved and can't be used in other Unicode encodings either (e.g. UTF-8 or UTF-32)