utf-8 word boundary regex in javascript

In JavaScript:

"ab abc cab ab ab".replace(/\bab\b/g, "AB");

correctly gives me:

"AB abc cab AB AB"

When I use utf-8 characters though:

"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");

the word boundary operator doesn't seem to work:

"αβ αβγ γαβ αβ αβ"

Is there a solution to this?

标签： javascript regex unicode utf-8 word-boundary

5条回答

公子世无双

2楼-- · 2019-01-02 21:41

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")

0人赞添加讨论(0) 举报

不流泪的眼

3楼-- · 2019-01-02 21:44

Not all Javascript regexp implementation has support for Unicode ad so you need to escape it

"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"

For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html

Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly

0人赞添加讨论(0) 举报

浅入江南

4楼-- · 2019-01-02 21:44

I needed something to be programmable and handle punctuation, brackets, etc.

http://jsfiddle.net/AQvyd/

var wordToReplace = '買い手',
    replacementWord = '[[BUYER]]',
    text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'

function replaceWord(text, wordToReplace, replacementWord) {
    var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
    return text.replace(re, replacementWord);
}

I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.

0人赞添加讨论(0) 举报

泪湿衣

5楼-- · 2019-01-02 21:46

When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.

0人赞添加讨论(0) 举报

栀子花@的思念

6楼-- · 2019-01-02 21:54

Not all the implementations of RegEx associated with Javascript engines a unicode aware.

For example Microsofts JScript using in IE is limited to ANSI.

0人赞添加讨论(0) 举报

utf-8 word boundary regex in javascript

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间