The Most Efficient Way To Find Top K Frequent Word

2019-01-02 19:47发布

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

19条回答
永恒的永恒
2楼-- · 2019-01-02 20:27

Simplest code to get the occurrence of most frequently used word.

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}
查看更多
登录 后发表回答