The Most Efficient Way To Find Top K Frequent Word

2019-01-02 19:47发布

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

19条回答
高级女魔头
2楼-- · 2019-01-02 20:14

In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.

查看更多
怪性笑人.
3楼-- · 2019-01-02 20:16
  1. Utilize memory efficient data structure to store the words
  2. Use MaxHeap, to find the top K frequent words.

Here is the code

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}   

}

Here is the unit tests

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

For more details refer this test case

查看更多
泪湿衣
4楼-- · 2019-01-02 20:16

I was struggling with this as well and get inspired by @aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:

  1. for each word, store it as part of map of it's occurrence: Map<String, Integer>.
  2. then, based on the count, remove it from the previous count set, and add it into the new count set.

The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.

The code

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}
查看更多
何处买醉
5楼-- · 2019-01-02 20:19

I just find out the other solution for this problem. But I am not sure it is right. Solution:

  1. Use a Hash table to record all words' frequency T(n) = O(n)
  2. Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
  3. Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
  4. After traverse the whole hash table, the result is in this buffer.
  5. The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
查看更多
萌妹纸的霸气范
6楼-- · 2019-01-02 20:20

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.

This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].

If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

查看更多
深知你不懂我心
7楼-- · 2019-01-02 20:20
**

C++11 Implementation of the above thought

**

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

查看更多
登录 后发表回答