可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.

I first tried sorting & slicing, which works. But this seems unnnecessary. You shouldn't need to sort a very large list if you just want the top 20 members. So I wrote a recursive routine which builds the top-n list. This also works, but is very much slower than the non-recursive one!

Question: Which is my second routine (elite2) so much slower than elite, and how do I make it faster ? My code is attached below. Thanks.

import scala.collection.SeqView
import scala.math.min
object X {

    def  elite(s: SeqView[Int, List[Int]], k:Int):List[Int] = {
        s.sorted.reverse.force.slice(0,min(k,s.size))
    }

    def elite2(s: SeqView[Int, List[Int]], k:Int, s2:List[Int]=Nil):List[Int] = {
        if( k == 0 || s.size == 0) s2.reverse
        else {
            val m = s.max
            val parts = s.force.partition(_==m)
            val whole = if( parts._1.size > 1) parts._1.tail:::parts._2 else parts._2
            elite2( whole.view, k-1, m::s2 )
        }
    }

    def main(args:Array[String]) = {
        val N = 1000000/3
        val x = List(N to 1 by -1).flatten.map(x=>List(x,x,x)).flatten.view
        println(elite2(x,20))
        println(elite(x,20))
    }
}

回答1:

Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:

  def topNs(xs: TraversableOnce[Int], n: Int) = {
    var ss = List[Int]()
    var min = Int.MaxValue
    var len = 0
    xs foreach { e =>
      if (len < n || e > min) {
        ss = (e :: ss).sorted
        min = ss.head
        len += 1
      }
      if (len > n) {
        ss = ss.tail
        min = ss.head
        len -= 1
      }                    
    }
    ss
  }

(edited because I originally used a SortedSet not realising you wanted to keep duplicates.)

I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite method takes about 850 ms and and your elite2 method takes about 4100 ms. So this is over 20 x quicker than your fastest.

回答2:

The classic algorithm is called QuickSelect. It is like QuickSort, except you only descend into half of the tree, so it ends up being O(n) on average.

回答3:

Don't overestimate how big log(M) is, for a large list of length M. For a list containing a billion items, log(M) is only 30. So sorting and taking is not such an unreasonable method after all. In fact, sorting an array of integers is far faster thank sorting a list (and the array takes less memory also), so I would say that your best (brief) bet (which is safe for short or empty lists thanks to takeRight)

val arr = s.toArray
java.util.Arrays.sort(arr)
arr.takeRight(N).toList

There are various other approaches one could take, but the implementations are less straightforward. You could use a partial quicksort, but you have the same problems with worst-case scenarios that quicksort does (e.g. if your list is already sorted, a naive algorithm might be O(n^2)!). You could save the top N in a ring buffer (array), but that would require O(log N) binary search every step as well as O(N/4) sliding of elements--only good if N is quite small. More complex methods (like something based upon dual pivot quicksort) are, well, more complex.

So I recommend that you try array sorting and see if that's fast enough.

(Answers differ if you're sorting objects instead of numbers, of course, but if your comparison can always be reduced to a number, you can s.map(x => /* convert element to corresponding number*/).toArray and then take the winning scores and run through the list again, counting off the number that you need to take of each score as you find them; it's a bit of bookkeeping, but doesn't slow things down much except for the map.)

回答4:

Here's pseudocode for the algorithm I'd use:

selectLargest(n: Int, xs: List): List
  if size(xs) <= n
     return xs
  pivot <- selectPivot(xs)
  (lt, gt) <- partition(xs, pivot)
  if size(gt) == n
     return gt
  if size(gt) < n
     return append(gt, selectLargest(n - size(gt), lt))
  if size(gt) > n
     return selectLargest(n, gt)

selectPivot would use some technique to select a "pivot" value for partitioning the list. partition would split the list into two: lt (elements smaller than the pivot) and gt (elements greater than the pivot). Of course, you'd need to throw elements equal to the pivot in one of those groups, or else handle that group separately. It doesn't make a big difference, as long as you remember to handle that case somehow.

Feel free to edit this answer, or post your own answer, with a Scala implementation of this algorithm.

回答5:

I wanted a version that was polymorphic, and also allowed to compose using a single iterator. For instance, what if you wanted the top largest and smallest elements when reading from a file? Here is what I came up with:

    import util.Sorting.quickSort

    class TopNSet[T](n:Int) (implicit ev: Ordering[T], ev2: ClassManifest[T]){
      val ss = new Array[T](n)
      var len = 0

      def tryElement(el:T) = {
        if(len < n-1){
          ss(len) = el
          len += 1
        }
         else if(len == n-1){
          ss(len) = el
          len = n
          quickSort(ss)
        }
        else if(ev.gt(el, ss(0))){
          ss(0) = el
          quickSort(ss)
        }
      }
      def getTop() = {
        ss.slice(0,len)
      }
    }

Evaluating compared to the accepted answer:

val myInts = Array.fill(100000000)(util.Random.nextInt)
time(topNs(myInts,100)
//Elapsed time 3006.05485 msecs
val myTopSet = new TopNSet[In](100)
time(myInts.foreach(myTopSet.tryElement(_)))
//Elapsed time 4334.888546 msecs

So, not much slower, and certainly a lot more flexible