Regex match Arabic keyword

2019-02-18 09:33发布

问题:

I have simple regex which founds some word in text:

var patern = new RegExp("\bsomething\b", "gi");

This match word in text with spaces or punctuation around.

So it match:

I have something.

But doesn't match:

I havesomething.

what is fine and exactly what I need.

But I have issue with for example Arabic language. If I have regex:

var patern = new RegExp("\bرياضة\b", "gi");

and text:

رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي 

The keyword which I am looking for is at the end of the text.

But this doesn't work, it just doesn't find it.

It works if I remove \b from regex:

var patern = new RegExp("رياضة", "gi");

But that is now what I want, because I don't want to find it if it's part of another word like in english example above:

 I havesomething.

So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.

回答1:

We have first to understand what does \b mean:

\b is an anchor that matches at a position that is called a "word boundary".

In your case, the word boundaries that you are looking for are not having other Arabic letters.

To match only Arabic letters in Regex, we use unicode:

[\u0621-\u064A]+

This code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:

[^\u0621-\u064A]ARABIC TEXT[^\u0621-\u064A]

The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.

Consider this example that you gave us which I modified a little bit:

 أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا 

Imagine we are trying to match only رياض, but this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.

var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^\u0621-\u064A]رياض[^\u0621-\u064A])/g, '<span style="color:red">$1</span>');
document.write (x);

If you would like to account for أأإا with one code, you could use something like this [\u0622\u0623\u0625\u0627]. Here is a complete code

var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([\u0622\u0623\u0625\u0627]نا)/g, '<span style="color:red">$1</span>');
document.write (x);



回答2:

This doesn't work because of the Arabic language which isn't supported on the regex engine. You could search for the unicode chars in the text (Unicode ranges).

Or you could use encoding to convert the text into unicode and then make somehow the regex (i never have tried this but it should work).