After trying to figure how to have an effective word counter of a string, I know about the existing function that PHP has str_word_count
but unfortunately it doesn't do what I need it to do because I will need to count the number of words that includes English, Chinese, Japanese and other accented characters.
However str_word_count
fails to count the number of words unless you add the characters in the third argument but this is insane, it could mean I have to add every single character in the Chinese, Japanese, accented characters (etc) language but this is not what I need.
Tests:
str_word_count('The best tool'); // int(3)
str_word_count('最適なツール'); // int(0)
str_word_count('最適なツール', 0, '最ル'); // int(5)
Anyway, I found this function online, it could do the job, but sadly it fails to count:
function word_count($str)
{
if($str === '')
{
return 0;
}
return preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u", $str);
}
Tests:
word_count('The best tool') // int(3)
word_count('最適なツール'); // int(1)
// With spaces
word_count('最 適 な ツ ー ル'); // int(5)
Basically I'm looking for a good UTF-8 supported word counter that can count words from every typical word/accented/language symbols - is there a possible solution to this?
You can take a look at the mbstring extension to work with UTF-8 strings.
mb_split() split a mb string using a regex pattern.
<?php
printf("Counting words in: %s\n", $argv[1]);
mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8");
$r = mb_split(' ', $argv[1]);
print_r($r);
printf("Word count: %d\n", count($r));
$ php mb.php "foo bar"
Counting words in: foo bar
Array
(
[0] => foo
[1] => bar
)
Word count: 2
$ php mb.php "最適な ツール"
Counting words in: 最適な ツール
Array
(
[0] => 最適な
[1] => ツール
)
Word count: 2
Note: I had to add 2 spaces between characters to get a correct count
Fixed by setting mb_regex_encoding()
& mb_internal_encoding()
to UTF-8
However, in Chinese the concept of "words" doesn't exist (and may too in Japanese in some case), so you may never get a pertinent result in such way...)
You may need to write an algorithm using a dictionnary to determine which groups of characters is a "word"
There's the Kuromoji morphological analyzer for Japanese that can be used for word counting. Unfortunately it's written in Java, not PHP. Since porting it all to PHP is quite a huge task, I'd suggest writing a small wrapper around it so you can call it on the command line, or look into other PHP-Java bridges.
I don't know how applicable it is to languages other than Japanese. You may want to look into the Apache Tika project for similar such libraries.
I've had good results using the Intl
extension's break iterator which tokenizes strings using locale-aware word boundaries. e.g:
<?php
$words = IntlBreakIterator::createWordInstance('zh');
$words->setText('最適なツール');
$count = 0;
foreach( $words as $offset ){
if( IntlBreakIterator::WORD_NONE !== $words->getRuleStatus() ){
$count++;
}
}
printf("%u words", $count ); // 3 words
As I don't understand Chinese I can't verify that "3" is the correct answer. However, it produces accurate results for scripts I do understand, and I am trusting in the ICU library to be solid.
I also note that the passing of the "zh" parameter seems to make no difference to the result, but the argument is mandatory.
I'm running Intl PECL-3.0.0 and ICU version is 55.1. I discovered that my CentOS servers were running older versions than these and they didn't work for Chinese. So make sure you have the latest versions.