The Most Efficient Way To Find Top K Frequent Word-第3页回答

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K)， Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

标签： algorithm word-frequency

19条回答

唯独是你

2楼-- · 2019-01-02 20:21

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

0人赞添加讨论(0) 举报

妖精总统

3楼-- · 2019-01-02 20:21

use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element

To restrict the size of treeset see this link

0人赞添加讨论(0) 举报

后来的你喜欢了谁

4楼-- · 2019-01-02 20:24

Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

http://en.wikipedia.org/wiki/Trie

0人赞添加讨论(0) 举报

一个人的天荒地老

5楼-- · 2019-01-02 20:25

I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.

0人赞添加讨论(0) 举报

心情的温度

6楼-- · 2019-01-02 20:26

Your problem is same as this- http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/

Use Trie and min heap to efficieinty solve it.

0人赞添加讨论(0) 举报

伤终究还是伤i

7楼-- · 2019-01-02 20:27

This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Also there is an implementation of it here.

0人赞添加讨论(0) 举报

The Most Efficient Way To Find Top K Frequent Word

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间