How can I make a regular expression which takes ac

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that

A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". AS3 RegExp to match words with boundry type characters in them

And since

\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml

obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..

Any help?

Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:

var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";

标签： javascript regex diacritics word-boundary

2条回答

祖国的老花朵

2楼-- · 2019-01-26 13:01

While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.

By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:

"\\b([a-z]{2})\\b,?"

I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:

"\\b[a-z]{2}\\b"

0人赞添加讨论(0) 举报

劫难

3楼-- · 2019-01-26 13:05

Have you set JavaScript to use non-ASCII? Here is a page that suggests setting JavaScript to use UTF-8: http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8

It says:

add a charset attribute (charset="utf-8") to your script tags in the parent page:
script type="text/javascript" src="[path]/myscript.js"  charset="utf-8"

0人赞添加讨论(0) 举报

How can I make a regular expression which takes ac

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间