I'm trying to pull all sentences from a text that consist of, say, at least 5 words in PHP. Assuming sentences end with full stop, question or exclamation mark, I came up with this:
/[\w]{5,*}[\.|\?|\!]/
Any ideas, what's wrong?
Also, what needs to be done for this to work with UTF-8?
\w
only matches a single character. A single word would be \w+
. If you need at least 5 words, you could do something like:
/(\w+\s){4,}\w+[.?!]/
i.e. at least 4 words followed by spaces, followed by another word followed by a sentence delimiter.
I agree with the solution posted here. If you're using preg functions in PHP you can add 'u' pattern modifier for this to work with UTF-8. /(\w+\s){4,}\w+[.?!]/u
for example
The without regex method:
$str = "this is a more than five word sentence. But this is not. Neither this. NO";
$sentences = explode(".", $str);
foreach($sentences as $s)
{
$words = explode(' ', $s);
if(count(array_filter($words, 'is_notempty')) > 5)
echo "Found matching sentence : $s" . "<br/>";
}
function is_notempty($x)
{
return !empty($x);
}
This outputs:
Found matching sentence : this is a more than five word sentence