Regex to match sentences with at least n words

2019-07-22 09:07发布

问题:

I'm trying to pull all sentences from a text that consist of, say, at least 5 words in PHP. Assuming sentences end with full stop, question or exclamation mark, I came up with this:

 /[\w]{5,*}[\.|\?|\!]/ 

Any ideas, what's wrong?

Also, what needs to be done for this to work with UTF-8?

回答1:

\w only matches a single character. A single word would be \w+. If you need at least 5 words, you could do something like:

/(\w+\s){4,}\w+[.?!]/

i.e. at least 4 words followed by spaces, followed by another word followed by a sentence delimiter.



回答2:

I agree with the solution posted here. If you're using preg functions in PHP you can add 'u' pattern modifier for this to work with UTF-8. /(\w+\s){4,}\w+[.?!]/u for example



回答3:

The without regex method:

$str = "this is a more than five word sentence. But this is not. Neither this. NO";

$sentences = explode(".", $str);
foreach($sentences as $s)
{
   $words = explode(' ', $s);
   if(count(array_filter($words, 'is_notempty')) > 5)
       echo "Found matching sentence : $s" . "<br/>";
}

function is_notempty($x)
{
 return !empty($x);
}

This outputs:

Found matching sentence : this is a more than five word sentence



标签: php regex utf-8