Bloom filter or cuckoo hashing?

2020-05-19 00:29发布

Which do you prefer and why?

They both can be used to accomplish similar tasks but I'm curious as to see what people have used in actual applications and their reasoning for doing so.

5条回答
▲ chillily
2楼-- · 2020-05-19 01:01

Which do you prefer, wine or cheese?

A bloom filter is for when you have limited space, high query cost, and mostly negative queries.
In that case, a bloom filter with 8 bits per key and 4 hash functions gives you 2.5% false positive rate; you process queries nearly 40 times faster than before, at the cost of 1 byte per key.

On the other hand, if any of the previous conditions do not hold, a hash table acting as a cache makes sense, though it obviously will take a lot more than one byte per entry :-)

You can even skip over the hard edge cases of cuckoo hashing if it's a cache. That also makes the size-increase problems of cuckoo hash tables (or anything other than linear hash) moot.

查看更多
\"骚年 ilove
3楼-- · 2020-05-19 01:04

If I can tolerate the false positives and space is critical, I use a Bloom filter because it takes less space. Otherwise, I use a hash.

查看更多
对你真心纯属浪费
4楼-- · 2020-05-19 01:07

Cuckoo Filter.

"Cuckoo Filter: Practically Better than Bloom." Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher CoNext 2014. http://dx.doi.org/10.1145/2674005.2674994

From one of the authors' blog:

Let me describe a cuckoo filter and some of what's in the paper for you. If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do). If you want to look at code, there's even a github repository for you with code for cuckoo filters.

查看更多
祖国的老花朵
5楼-- · 2020-05-19 01:10

I prefer cuckoo hashing. I am wary of the false positives that may show up with bloom filters at higher fill factors.
Have used cuckoo hashing in an application where we had very large hash tables and were running into memory pressure issues. Please see my eCollections library at http://codeplex.com/ecollections for the implementation of a variant of cuckoo hashing.

Kind regards,

查看更多
冷血范
6楼-- · 2020-05-19 01:12

Bloom filters and Cuckoo filters are used in similar situations but there's a lot of differences underneath that usually determine which is a better choice.

Bloom filters are used internally in database engines, notably Apache Cassandra. The reasons are as other posters said, to reduce the cost of slow set operations. Basically, any "does this maybe or definitely not exist" operation with a high cost can use a Bloom filter to reduce the number of checks done.

Another common example with today's SaaS model would be a remote REST service with a cost-per-call. Any API call with a binary answer such as "is this address INVALID" can use a bloom filter to eliminate over 90% of duplicate queries! Note that since Bloom and Cuckoo filters have false positives they are NOT useful for the inverse operation "is this address VALID"

Important to remember is that Bloom and Cuckoo filters have NO false negatives. This makes these filters useful for checks like "is this definitely not or maybe spam" but not useful for operations where false positives are unacceptable, like checking user permissions. In this aspect they can be conceptually considered the opposite of a cache. Both Bloom/Cuckoo filter and caches are used primarily to reduce the cost of expensive operations with a boolean answer, except caches have no false positives and Bloom/Cuckoo have no false negatives.

Notable differences between Cuckoo/Bloom include:

  • Combination. Bloom filters can be efficiently merged as long as they are created with the same parameters. Both quickly and with little bandwidth. This is why you see them used frequently in massively distributed systems, exchanging Bloom filters is fast. Cuckoo filters are not easily composable, making them less useful in these circumstances.

  • False positive rate. Cuckoo filters are more space efficient. Many use cases for both structures are focused on low level networking. On weak hardware the ~40% higher efficiency of Cuckoo filters for the same false positive rate can be important. The reference implementation, in c++, sorts items within each bucket for additional space savings, taking advantage of an item's position within a bucket to store smaller fingerprints. The additional libraries I'll mention later (including mine) don't seem to do this. If anyone ever uses my library I might add it :).

  • Constant false positive rate. Bloom filters have asymptotically worse false positive rates as they surpass their designed size. You can keep inserting items forever, but eventually your false positive rate will be nearly 100%. Cuckoo filters, being based on Cuckoo hashing, have a set capacity where insertions will actually fail. Repeat insertion of non-random item hashes can cause Cuckoo filters to fail insertion, possibly far before their designed fill level.

  • Speed. This is subjective and depends a lot on the hardware, but Cuckoo filters are generally faster in the average case (in my experience). Most Bloom filter designs run a hash function twice. When using secure hash functions especially, this can be a big handicap compared to Cuckoo filters which only hash inserted items once. The code I've seen uses various hashing functions for Bloom and Cuckoo filters. Google's Guava Bloom uses Murmur3, many other implementations use SHA1 or something else. If hash collisions can be exploited for you use case, make sure the library uses a secure hash. Important to know is that Bloom filters take roughly constant time to insert while Cuckoo filters have a constant-time AVERAGE case. As a Cuckoo filters get within a few percent of capacity insert speeds slow down greatly. Even then, only the insert speed slows, all other operations are constant average time.

  • Flexibility. Bloom filters only support insert and contains. Cuckoo filters additionally support deletion and limited counting. In the reference design, Cuckoo filters can determine how many times an item was inserted, up to 7 times. Bloom filters can only determine yes-no. Cuckoo filters also support deleting inserted items, a big positive in a lot of use cases compared to Bloom. When using Bloom filters, it is pretty standard to recreate the filter from scratch when it is "full" (estimated false positive rate exceeds threshold) since you can't delete old items. Note that rebuilding the filter still happens with Cuckoo filters when inserts start to fail, so depending on the use case this might be moot. In certain situations Cuckoo filters are more useful since you can delete items to stay within filter limits instead of rebuilding.

  • Support. Cuckoo filters are new and stable libraries for many languages simply don't exist.

The biggest advantage of Bloom filters is that they have more mature library support in most languages. The math behind Bloom filters is also better understood by scientists. Most of the characteristics of Cuckoo filters have been determined empirically, whereas Bloom filters have a solid numerical basis. This excludes Cuckoo filters for realtime and critical systems that must have verification of their performance, even though experimental evidence shows that Cuckoo filters perform better in most circumstances.

Shameless Plug: I'm the developer of a Cuckoo filter library for Java. CuckooFilter4J . It is missing the bucket semi-sort used in the paper so the space efficiency is somewhat lower than reference implementation. In the project readme I have links to other implementations I'm aware of. Which structure is better depends on your use case, but mostly on whether a solid Cuckoo filter implementation exists for your language.

You should definitely take a look at the source before you use a Cuckoo/Bloom filter in production. I read through various libs before writing my own... many of them had silent size limits due to 32-bit underlying arrays or obvious performance problems. Most had zero tests. Google's Guava Bloom implementation had by far the best code quality and tests (and supports 64 bit array limits). The only shortcomings with Guava's Bloom is that it doesn't have an option to use a secure hash function and isn't multi-threaded.

In a production system you might want multi-threading for speed. The answer for Guava's Bloom is to make a different filter for each thread and combine them occasionally. Since Cuckoo filters can't be combined, I added concurrent threading to my Cuckoo filter library. The other one's I'm aware of aren't thread safe or aren't concurrent.

查看更多
登录 后发表回答