I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". AS3 RegExp to match words with boundry type characters in them
And since
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml
obviously accented characters are not taken into account. This becomes a problem with words like Montréal
. If the é
is considered a word boundary, then al
is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..
Any help?
Here is the relevant JavaScript code, which searches userInput
and finds two-letter words using the re_state
regular expression:
var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
While JavaScript regexes recognize non-ASCII characters in some cases (like
\s
), it's hopelessly inadequate when it comes to\w
and\b
. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.By the way, there's an error in your regex. You have a
\b
after the optional trailing comma, but it should be in front:I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all;
\b
should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:Have you set JavaScript to use non-ASCII? Here is a page that suggests setting JavaScript to use UTF-8: http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8
It says: