Can someone explain how the Count Sketch Algorithm works? I still can't figure out how hashes are used, for example. I have a hard time understanding this paper.
相关问题
- Finding k smallest elements in a min heap - worst-
- binary search tree path list
- High cost encryption but less cost decryption
- How to get a fixed number of evenly spaced points
- Space complexity of validation of a binary search
相关文章
- What are the problems associated to Best First Sea
- Coin change DP solution to keep track of coins
- Algorithm for partially filling a polygonal mesh
- Robust polygon normal calculation
- Algorithm for maximizing coverage of rectangular a
- Streaming music synchronously from a mp3 file via
- How to measure complexity of a string?
- Select unique/deduplication in SSE/AVX
Count sketch is a probabilistic data structure which allows you to answer the following question:
Reading a stream of elements
a1, a2, a3, ..., an
where there can be a lot of repeated elements, in any time it will give you the answer to the following question: how manyai
elements have you seen so far.You can clearly get an exact value at each time just by maintaining the hash where keys are your
ai
and values is how many elements you have seen so far. It is fastO(1)
add,O(1)
check and it give you an exact count. The only problem that it takesO(n)
space, where n is the number of distinct elements (keep in mind that the size of each element has a big difference because it takesway more space to store this big string as a key
than justthis
.So how Count sketch is going to help you? As in all probabilistic data structures you sacrifice certainty for space. Count sketch allows you to select 2 parameters: accuracy of the results ε and probability of bad estimate δ.
To do this you select a family of
d
pairwise independent hash functions. These complicated words mean that they do not collide to often (in fact if both hashes map values onto space[0, m]
the probability of collision is approximately1/m^2
). Each of these hash functions maps the values to a space[0, w]
. So you create and * w
matrix.Now for when you read the element you calculate each of
d
hashes of this element and update the corresponding values in the sketch. This part is the same for Count sketch and Count-min sketch.Insomniac nicely explained the idea (calculating expected value) for count sketch, so I will just tell that with count-min everything is even simpler. You just calculate d hashes of the value you want to get and return the smallest of them. Surprisingly this provides a strong accuracy and probability guarantee, which you can find here.
Increasing the range of hash functions, increase the accuracy of results, increasing the number of hashes decreases the probability of bad estimate: ε = e/w and δ=1/e^d. Another interesting thing is that the value is always overestimated (if you found the value, it is most probably bigger than the real value, but surely not smaller).
This streaming algorithm instantiates the following framework.
Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).
To reduce the variance/noise, run many independent copies in parallel and combine their outputs.
Usually 1 is more interesting than 2. This algorithm's 2 actually is somewhat nonstandard, but I'm going to talk about 1 only.
Suppose we're processing the input
With three counters, there's no need to hash.
Let's suppose however that we have just one. There are eight possible functions
h : {a, b, c} -> {+1, -1}
. Here is a table of the outcomes.Now we can calculate expectations
What's going on here? For
a
, say, we can decomposeX = Y + Z
, whereY
is the change in the sum for thea
s, andZ
is the sum for the non-a
s. By the linearity of expectation, we haveE[h(a) Y]
is a sum with a term for each occurrence ofa
that ish(a)^2 = 1
, soE[h(a) Y]
is the number of occurrences ofa
. The other termE[h(a) Z]
is zero; even givenh(a)
, each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.
I'll leave the new calculations to you.