Split JavaScript string into array of codepoints?

2019-01-18 21:18发布

问题:

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

For the purposes of this question I do not require splitting by grapheme cluster.

回答1:

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.



回答2:

In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.

So doing it the manual way:

String.prototype.toCodePoints= function() {
    chars = [];
    for (var i= 0; i<this.length; i++) {
        var c1= this.charCodeAt(i);
        if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
            var c2= this.charCodeAt(i+1);
            if (c2>=0xDC00 && c2<0xE000) {
                chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                i++;
                continue;
            }
        }
        chars.push(c1);
    }
    return chars;
}

For the inverse to this see https://stackoverflow.com/a/3759300/18936



回答3:

Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

const chars = [...text]

e.g., with:

const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "