This question already has an answer here:
- Javascript + Unicode regexes 10 answers
I'm trying to use Javascript's string.match() function for a fairly simple task: extract all the words from a string and then count the number of occurrences of each word. The regular expression:
/\w+/g
works fine for this task except for the fact that it can't handle any sort of unicode/international characters. What's the best/cleanest way to be able to match accented characters, the Cyrillic alphabet, and any other major alphabets?
If it happens to matter, I'm currently coding in a Node.js environment.
XRegExp:
In comments, elclanrs, suggested XRegExp. If you use it, you will need the Unicode plugin, and Categories 1.2.0 for it. This package and plugins appear to be a robust, and significant addition to regular expressions in JavaScript. You will then need to make sure to construct your regex using multiple character classes because it does not define a
\w
equivalent. Be sure to also include at least the "Mn" class of non-spacing marks as those are commonly used to construct non-English characters.Stand alone definitions of Unicode character types:
Prior to knowing about XRegExp, I adapted a version of Unicode from the net to use in a project. Below is an implementation of Unicode character classes. It is significantly less capable than the full XRegExp add-in, but would get the job done for the simple regex which you mention.
You would want:
myRegex = new RegExp("[" + Unicode.w + "]+", "g");