Can you help me in finding a regex that take list of phrases and check if one of these phrases exist in the given text, please?
Example:
If I have in the hashSet
the following words:
كيف الحال
إلى أين
أين يوجد
هل من أحد هنا
And the given text is: كيف الحال أتمنى أن تكون بخير
I want to get after performing regex: كيف الحال
My initial code:
HashSet<String> QWWords = new HashSet<String>();
QWWords.add("كيف الحال");
QWWords.add("إلى أين");
QWWords.add("أين يوجد");
QWWords.add("هل من أحد هنا");
String s1 = "كيف الحال أتمنى أن تكون بخير";
for (String qp : QWWords) {
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
Matcher m = p.matcher(s1);
String found = "";
while (m.find()) {
found = m.group();
System.out.println(found);
}
}
[...]
is character class and character class can match only one character it specifies. For instance character class like [abc]
can match only a
OR b
OR c
. So if you want to find only word abc
don't surround it with [...]
.
Another problem is that you are using \\s
as word separator, so in following String
String data = "foo foo foo foo";
regex \\sfoo\\s
will not be able to match first foo
because there is no space before.
So first match it will find will be
String data = "foo foo foo foo";
// this one--^^^^^
Now, since regex consumed space after second foo
it can't reuse it in next match so third foo
will also be skipped because there is no space available to match before it.
You will also not match forth foo
because this time there is no space after it.
To solve this problem you can use \\b
- word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).
So instead of
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
use
Pattern p = Pattern.compile("\\b" + qp + "\\b");
or maybe better as Tim mentioned
Pattern p = Pattern.compile("\\b" + qp + "\\b",Pattern.UNICODE_CHARACTER_CLASS);
to make sure that \\b
will include Arabic characters in predefined alphanumeric class.
UPDATE:
I am not sure if your words can contain regex metacharacters like {
[
+
*
and so on, so just in case you can also add escaping mechanism to change such characters into literals.
So
"\\b" + qp + "\\b"
can become
"\\b" + Pattern.quote(qp) + "\\b"