Please identify this algorithm: probabilistic top-

2019-03-10 03:58发布

问题:

I remember hearing about the following algorithm some years back, but can't find any reference to it online. It identifies the top k elements (or heavy hitters) in a data stream of n elements using only m counters. This is particularly useful for finding top search terms, network abusers, etc. while using minimal memory.

The algorithm: for each element,

  1. If the element does not already have a counter and counters < m, create a counter for the element and initialize to 1.
  2. Else if the element does have a counter, increment it.
  3. Else if the element does not have a counter and counters > m, decrement an existing counter c. If c reaches 0, replace its corresponding element, with the current element. (c is an index into the list of existing counters, where c increases in round robin fashion for each element that reaches this step.)

I have found many other similar algorithms (many of which are listed, though not described, in this wikipedia article about streaming algorithms), but not this one. I particularly like it because it is as simple to implement as it is to describe.

But I'd like to learn more about its probabilistic characteristics- if I'm only interested in the top 100 items, what effect does using 1,000 counters instead of 100 counters have?

回答1:

You are talking about the notable Misra-Gries Algorithm, and Space-Saving Algorithm is a faster version of Misra-Gries Algorithm. Please check this lecture note for detail Streaming Algorithm Dartmouth sec 1.2.

One thing I want to point out is that this algorithm does not give you the top-k elements if you only used k counters, instead, it gives all elements with frequency > m / k, where m is the total length of the data stream.

Detailed analysis can be found in the lecture notes I attached.



回答2:

You may be looking for the "Frequent" algorithm. It uses k - 1 counters to find all elements that exceed 1/k of the total, and was published in 1982 by Misra and Gries. It's a generalization of Boyer and Moore's (or Fischer-Salzberg's) "Majority" algorithm, where k is 2. These and related algorithms are introduced in a helpful article, "The Britney Spears Problem."

I give a detailed explanation of the algorithm elsewhere on StackOverflow, which I won't repeat here. The important point is that, after one pass, the counter values don't precisely indicate the frequency of an item; they can under-count by a margin that depends on the length of the stream and inversely on the number of counters (n / k). All of these algorithms (including Metwally's "SpaceSaving") require a second pass if you want an exact count rather than an estimate of frequency.



回答3:

That looks like the CPU cache replacement algoritme Least frequently used (LFU)

The algorithm: for each element,

  1. If the element does not already have a counter and counters < m, create a counter for the element and initialize to 1. a. if the cache is not full add the line.
  2. Else if the element does have a counter, increment it. a. increment the cache line counter
  3. Else if the element does not have a counter and counters > m, decrement an existing counter. If c reaches 0, replace its corresponding element, with the current element. (c is an index into the list of existing counters, where c increases in round robin fashion for each element that reaches this step.)

    • a. advance to the next candidate cache line
    • b. decrement the current candidates counter
    • c. if it hasn't reached zero go to a.
    • d. evict the cache line and replace it with the new.