Counting unicode characters in Javascript [duplica

2020-07-18 10:39发布

问题:

I ran into an issue with counting unicode characters. I need to count total combined unicode characters.

Take this character for example:

द्ध

if you use .length property on this string it gives you 3. Which is technically correct as it is a combination of

, and

However, put द्धin a text area and then you realize by using arrow keys that it is considered as one character. Only if you use backspace you realize that there are 3 characters.

Edit: Also for your test case please consider that it could be a word. It could be something like,

द्धद्द

This should give 2 with .length, but gives 6

This is a problem when you want to get or set the current caret position in input elements.

回答1:

Your example “द्ध” is a string of three Unicode characters, and the length property correctly indicates this.

What you apparently to want to count is “characters” in some other sense, something like “what a speaker of a language intuitively sees as one character”. This is a vague and mutable concept. The Unicode standard annex UAX #29 Unicode Text Segmentation tries to analyze the concept, calling it “grapheme cluster”, and describes some algorithms on working with it.

Unfortunately, JavaScript has no built-in tools for recognizing whether a character is e.g. combining mark and this should be regarded as part of a cluster. However, if you can limit yourself to handling just one writing system, you can probably code the operations manually, referring to possible Unicode characters by their code numbers.

Moreover, if the intent is to make the count match the way some input editor works (e.g. how the arrow keys more over characters), you would need to know the logic of that editor. It may implement Unicode grapheme clusters in some sense, or something else.