What is the most efficient way to ge the top N ite

2019-06-14 10:34发布

问题:

Say you have 4 sorted sets with thousands and thousands of keys and scores. Since they are sorted sets, getting the top items can ben done in logaritmic time complexity.

The easy way would be to take the union of the sets, and then get the top items. But doing so is at least linear to the sum of all items in all sets.

The best way I could think of is this:

  1. Take the top N items from every set
  2. Find the item with the lowest rank and the higest score for that rank.
  3. Devide that score by the number of sets. (Any key with a score lower than this can never be in the top N)
  4. Take the union of those keys. (Ignoring scores)
  5. Find the scores for all keys in all sets. (A key might have score 1 in one set and 10000 in another)

That is like, finding all keys that could possibly be in the top list, and do the union with those keys. There are probably more efficient ways to limit the number of items to consider.

[edit] Keys occur in one or more sets, and their summed scores determines the final score. So a key that is in all sets with a low score might have a higher score than a key with a high score that is in only one set.

回答1:

The algorithm you propose seems quite awkward. Just take one of the following:

The simple way

for i = 1 to n
    loop through all sets and look at their smallest element,
    pick the smallest element and remove it from the sets

Complexity: O(n * s) where n is the number of items you want and s is the number of sets.

Of course, if you are not allowed to remove elements from the sets, you can also maintain iterators into each set to get elements from them in sorted order without having to alter the sets.

A more efficient way

Maintain a priority queue over all the smallest elements of each set. Whenever removing the smallest element e from that priority queue, reinsert the next element from the set from which e came.

Complexity: Assume a simple priority queue with O(log n) 'insert' and O(log n) 'remove smallest element' complexity. There are better ones like fibonacci heaps, but this one will do just fine. Then we have:

  • s insertions to fill the priority queue at the start, so O(s log s).
  • n "delete smallest element" + insert a new one, so O(n log s) (since there are always s elements in the queue)

Thus, we achieve O(s log s + n log s) which is way better.

Comparison

As long as s is quite small, there shouldn't really be a big difference between the algorithms and you can also pick the simple one. If you have a lot of sets, then you should definitely go for the second approach.

Lookup Complexity

In my analysis, I omitted the logarithmic lookup factor to find the smallest element for each set and assumed that the smallest element of each set could be retrieved in O(1), like in a sorted list. Varying the lookup cost from O(1) to O(log n) just introduces an additional factor that does not alter the algorithms. In addition, you usueally only pay the O(log n) once at the first lookup. Afterwards, you usually have an iterator to the smallest element. Accessing each further element using the iterator is then only O(1).