In JavaScript:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
correctly gives me:
"AB abc cab AB AB"
When I use utf-8 characters though:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
the word boundary operator doesn't seem to work:
"αβ αβγ γαβ αβ αβ"
Is there a solution to this?
The word boundary assertion does only match if a word character is not preceded or followed by another word character (so
.\b.
is equal to\W\w
and\w\W
). And\w
is defined as[A-Za-z0-9_]
. So\w
doesn’t match greek characters. And thus you cannot use\b
for this case.What you could do instead is to use this:
Not all Javascript regexp implementation has support for Unicode ad so you need to escape it
For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html
Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly
I needed something to be programmable and handle punctuation, brackets, etc.
http://jsfiddle.net/AQvyd/
I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.
When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using
\b
. See this answer for details and directions.Not all the implementations of RegEx associated with Javascript engines a unicode aware.
For example Microsofts JScript using in IE is limited to ANSI.