The str_word_count() function returns an array that holds all words in a string. It works great, except when using special characters. In this case, the php script receives the string via querystring:
When i open: http://localhost/index.php?q=this%20wórds
header('Content-Type: text/html; charset=utf-8');
print_r(str_word_count($_GET['q'],1,'ó'));
Instead of returning:
[0] this
[1] wórds
...it returns:
[0] this
[1] w
[2] rds
How could this function support those special characters that are being sent through querystring?
Update - it worked out just fine by using mario's solution:
function sanitize_words($string) {
preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u",$string,$matches,PREG_PATTERN_ORDER);
return $matches[0];
}
Not sure if that third parameter is sufficient to make
str_word_count
work for non-ASCII symbols. It probably only works withLatin-1
if anything.As alternative you could count the words with a regex however:
This works for UTF-8 at least. To fully replicate
str_word_count
you might need[\pL']+
eventually.All possible combinations:
What about just
You can also explode( ' ', $string ) on the string and count( $array );
for German language use this :
for all other languages - just to replace the special characters with yours (French, Polish etc...)