I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:
1) count each Asian CHARACTER as 1; 2) count each Alphanumeric WORD as 1;
a few examples:
株式会社myCompany = 4 chars + 1 word = 5 total 株式会社マイコ = 7 chars
my only idea so far is to use:
var wordArray=val.split(/\w+/);
and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.
Ideas?
I think you want to loop over all characters, and increase a counter every time the current character is in a different word (according to your definition) than the previous one.
Unfortunately JavaScript's
RegExp
has no support for Unicode character classes;\w
only applies to ASCII characters (modulo some browser bugs).You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:
(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)
Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!
You can iterate over each character in the text, examining each one to look for word breaks. The following example does this, counting each Chinese/Japanese/Korean (CJK) ideograph as a single word, and treating all alphanumeric strings as single words.
Some notes on my implementation:
It probably doesn't handle accented characters correctly. They will probably trigger word breaks. You can modify the
wordBreakRegEx
to fix this.cjkRegEx
doesn't include some of the more esoteric code point ranges, since they require 5 hex digits to reference and JavaScript's regex engine doesn't seem to let you do that. But you probably don't need to worry about these, since I don't even think most fonts include them.I deliberately left Japanese Hiragana and Katakana out of
cjkRegEx
, since I'm not sure how you want to handle these. Depending on the type of text you're dealing with, it might make more sense to treat strings of them as single words. In that case, you'd need to add logic to recognize being in a "kana word" versus in a "alphanumeric word". If you don't care, then you just need to add their code point ranges tocjkRegEx
. Of course, you could try to recognize word breaks within kana strings, but that quickly becomes Very Hard.Example implementation:
The Unihan Database is very helpful for learning about CJK in unicode. Also of course the Unicode home page has loads of info.