Regex to match sentences with at least n words

2019-07-22 09:18发布

I'm trying to pull all sentences from a text that consist of, say, at least 5 words in PHP. Assuming sentences end with full stop, question or exclamation mark, I came up with this:

 /[\w]{5,*}[\.|\?|\!]/ 

Any ideas, what's wrong?

Also, what needs to be done for this to work with UTF-8?

标签: php regex utf-8
3条回答
Bombasti
2楼-- · 2019-07-22 09:24

The without regex method:

$str = "this is a more than five word sentence. But this is not. Neither this. NO";

$sentences = explode(".", $str);
foreach($sentences as $s)
{
   $words = explode(' ', $s);
   if(count(array_filter($words, 'is_notempty')) > 5)
       echo "Found matching sentence : $s" . "<br/>";
}

function is_notempty($x)
{
 return !empty($x);
}

This outputs:

Found matching sentence : this is a more than five word sentence

查看更多
Emotional °昔
3楼-- · 2019-07-22 09:28

\w only matches a single character. A single word would be \w+. If you need at least 5 words, you could do something like:

/(\w+\s){4,}\w+[.?!]/

i.e. at least 4 words followed by spaces, followed by another word followed by a sentence delimiter.

查看更多
欢心
4楼-- · 2019-07-22 09:47

I agree with the solution posted here. If you're using preg functions in PHP you can add 'u' pattern modifier for this to work with UTF-8. /(\w+\s){4,}\w+[.?!]/u for example

查看更多
登录 后发表回答