I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this? Thanks in advance.
<?php
$text = "A very nice to tot to text. Something nice to think about if you're into text.";
$words = str_word_count($text, 1);
$frequency = array_count_values($words);
arsort($frequency);
echo '<pre>';
print_r($frequency);
echo '</pre>';
?>
This is a function that extract common words from a string. it takes three parameters; string, stop words array and keywords count. you have to get the stop_words from txt file using php function that take txt file into array
$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$this->extract_common_words( $text, $stop_words)
You can use this file stop_words.txt as your primary stop words file, or create your own file.
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}
There's not additional parameters or a native PHP function that you can pass words to exclude. As such, I would just use what you have and ignore a custom set of words returned by str_word_count
.
You can do this easily by using array_diff()
:
$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");
print_r(array_diff($words, $stopwords));
gives
Array
(
[2] => do
[3] => this
[4] => I
[5] => do
[6] => that
)
But you have to take care of lower and upper case yourself. The easiest way here would be to
convert the text to lowercase beforehand.
Here is my solution by using the built-in PHP functions:
most_frequent_words — Find most frequent word(s) appeared in a String
function most_frequent_words($string, $stop_words = [], $limit = 5) {
$string = strtolower($string); // Make string lowercase
$words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
$words = array_diff($words, $stop_words); // Remove black-list words from the array
$words = array_count_values($words); // Count the number of occurrence
arsort($words); // Sort based on count
return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}
Returns array contains word(s) appeared most frequently in the string.
Parameters :
string $string - The input string.
array $stop_words (optional) - List of words which are filtered out from the array, Default empty array.
string $limit (optional) - Limit the number of words returned, Default 5.