Extracting Arabic words(not semantic arabic phrase

2019-06-09 19:58发布

问题:

This question already has an answer here:

  • Extract Arabic phrases from a given text in java 1 answer
String description="Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. البيانات الضخمة هي عبارة عن مجموعة من مجموعة البيانات الضخمة جداً والمعقدة لدرجة أنه يُصبح من الصعب معالجتها باستخدام أداة واحدة فقط من أدوات إدارة قواعد البيانات أو باستخدام تطبيقات معالجة البيانات التقليدية. "

I need a regex to extract only arabic words .

I check this ticket , however , it is a PHP ticket , while , i need JAVA regex .

import java.util.regex.*;
Pattern p = Pattern.compile("#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u");
print(p.matcher(description).group(1));

It raises an error .

回答1:

To find one or more Arabic characters you can use \p{InArabic}+

This class is not mentioned directly by Pattern documentation, but it gives us informations about

Classes for Unicode scripts, blocks, categories and binary properties
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)

and encouraged by example of \p{InGreek} we can start reading about blocks, to find that

Blocks are specified with the prefix In, as in InMongolian, or by using the keyword block (or its short form blk) as in block=Mongolian or blk=Mongolian.

The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

That last sentence is most important for us. Now we need to see if UnicodeBlocks should support group of Arabic characters. So we visit its documentation where we can find field

public static final Character.UnicodeBlock ARABIC

which suggest that there is support for Arabic characters block.


So to find single Arabic words your code can look like:

String description="Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. البيانات الضخمة هي عبارة عن مجموعة من مجموعة البيانات الضخمة جداً والمعقدة لدرجة أنه يُصبح من الصعب معالجتها باستخدام أداة واحدة فقط من أدوات إدارة قواعد البيانات أو باستخدام تطبيقات معالجة البيانات التقليدية. ";
Pattern p = Pattern.compile("\\p{InArabic}+";
Matcher m = p.matcher(description);
while(m.find()){
    System.out.println(m.group());
}

output:

البيانات
الضخمة
هي
.
.
.
البيانات
التقليدية

If you want to find groups of Arabic words separated by one or more whitespace you can this pattern

Pattern p = Pattern.compile("\\p{InArabic}+(?:\\s+\\p{InArabic}+)*");

You may want to know that * - represents zero or more, and + - one or more

So this regex means

\\p{InArabic}+     # one or more Arabic characters (Arabic word)
(?:                # non-capturing group storing:
  \\s+             # one or more whitespace characters
  \\p{InArabic}+   # with another Arabic word after it
)*                 # zero or more times