I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.
I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.
function sortSurrogates(str){
var cp = []; // array to hold code points
while(str.length){ // loop till we've done the whole string
if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
// High surrogate found low surrogate follows
cp.push(str.substr(0,2)); // push the two onto array
str = str.substr(2); // clip the two off the string
}else{ // else BMP code point
cp.push(str.substr(0,1)); // push one onto array
str = str.substr(1); // clip one from string
}
} // loop
return cp; // return the array
}
My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?
EDIT: To help illustrate the issue:
var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = "
Javascript string iterators can give you the actual characters instead of the surrogate code points:
Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so.
As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit.
Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. Javascript isn't good enough for that as you have discovered.
It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. I talk about all this in OSCON talk,
Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript:
ES6 Unicode shims for ES3+ adds the
String.fromCodePoint
andString.prototype.codePointAt
methods from ECMAScript 6. The ES3/5fromCharCode
andcharCodeAt
methods do not account for surrogate pairs and therefore give wrong results.Full 21-bit Unicode code point matching in XRegExp with
\u{10FFFF}
allows matching any individual code point in XRegExp regexes.This is along the lines of what I was looking for. It needs better support for the different string functions. As I add to it I will update this answer.
Test results:
I've knocked together the starting point for a Unicode string handling object. It creates a function called
UnicodeString()
that accepts either a JavaScript string or an array of integers representing Unicode code points and provideslength
andcodePoints
properties andtoString()
andslice()
methods. Adding regular expression support would be very complicated, but things likeindexOf()
andsplit()
(without regex support) should be pretty easy to implement.