Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
Here is the code
}
Here is the unit tests
For more details refer this test case
I was struggling with this as well and get inspired by @aly. Instead of sorting afterwards, we can just maintain a presorted list of words (
List<Set<String>>
) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:Map<String, Integer>
.The drawback of this is the list maybe big - can be optimized by using a
TreeMap<Integer, Set<String>>
- but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.The code
I just find out the other solution for this problem. But I am not sure it is right. Solution:
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
**
};