Which is faster, Hash lookup or Binary search?

2019-01-12 18:14发布

When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap or an array with a binary search using some custom comparator?

Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset size/set size?

The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.

While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.

16条回答
再贱就再见
2楼-- · 2019-01-12 19:04

The answers by Bobby, Bill and Corbin are wrong. O(1) is not slower than O(log n) for a fixed/bounded n:

log(n) is constant, so it depends on the constant time.

And for a slow hash function, ever heard of md5?

The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that.

You might be able to (partially) use a radix. If you can split up in 256 approximately same size blocks, you're looking at 2k to 40k binary search. That is likely to provide much better performance.

[Edit] Too many people voting down what they do not understand.

String compares for binary searching sorted sets have a very interesting property: they get slower the closer they get to the target. First they will break on the first character, in the end only on the last. Assuming a constant time for them is incorrect.

查看更多
淡お忘
3楼-- · 2019-01-12 19:04

Dictionary/Hashtable is using more memory and takes more time to populate comparing to array. But search is done faster by Dictionary rather than Binary Search within array.

Here are the numbers for 10 Million of Int64 items to search and populate. Plus a sample code you can run by yourself.

Dictionary Memory: 462,836

Array Memory: 88,376

Populate Dictionary: 402

Populate Array: 23

Search Dictionary: 176

Search Array: 680

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace BinaryVsDictionary
{
    internal class Program
    {
        private const long Capacity = 10000000;

        private static readonly Dictionary<long, long> Dict = new Dictionary<long, long>(Int16.MaxValue);
        private static readonly long[] Arr = new long[Capacity];

        private static void Main(string[] args)
        {
            Stopwatch stopwatch = new Stopwatch();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                Dict.Add(i, i);
            }

            stopwatch.Stop();

            Console.WriteLine("Populate Dictionary: " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                Arr[i] = i;
            }

            stopwatch.Stop();

            Console.WriteLine("Populate Array:      " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                long value = Dict[i];
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
            }

            stopwatch.Stop();

            Console.WriteLine("Search Dictionary:   " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                long value = BinarySearch(Arr, 0, Capacity, i);
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
            }

            stopwatch.Stop();

            Console.WriteLine("Search Array:        " + stopwatch.ElapsedMilliseconds);

            Console.ReadLine();
        }

        private static long BinarySearch(long[] arr, long low, long hi, long value)
        {
            while (low <= hi)
            {
                long median = low + ((hi - low) >> 1);

                if (arr[median] == value)
                {
                    return median;
                }

                if (arr[median] < value)
                {
                    low = median + 1;
                }
                else
                {
                    hi = median - 1;
                }
            }

            return ~low;
        }
    }
}
查看更多
叛逆
4楼-- · 2019-01-12 19:04

It depends on how you handle duplicates for hash tables (if at all). If you do want to allow hash key duplicates (no hash function is perfect), It remains O(1) for primary key lookup but search behind for the "right" value may be costly. Answer is then, theorically most of the time, hashes are faster. YMMV depending on which data you put there...

查看更多
别忘想泡老子
5楼-- · 2019-01-12 19:07

The only reasonable answer to this question is: It depends. It depends on the size of your data, the shape of your data, your hash implementation, your binary search implementation, and where your data lives (even though it's not mentioned in the question). A couple other answers say as much, so I could just delete this. However, it might be nice to share what I've learned from feedback to my original answer.

  1. I wrote, "Hash algorithms are O(1) while binary search is O(log n)." - As noted in the comments, Big O notation estimates complexity, not speed. This is absolutely true. It's worth noting that we usually use complexity to get a sense of an algorithm's time and space requirements. So, while it's foolish to assume complexity is strictly the same as speed, estimating complexity without time or space in the back of your mind is unusual. My recommendation: avoid Big O notation.
  2. I wrote, "So as n approaches infinity..." - This is about the dumbest thing I could have included in an answer. Infinity has nothing to do with your problem. You mention an upper bound of 10 million. Ignore infinity. As the commenters point out, very large numbers will create all sorts of problems with a hash. (Very large numbers don't make binary search a walk in the park either.) My recommendation: don't mention infinity unless you mean infinity.
  3. Also from the comments: beware default string hashes (Are you hashing strings? You don't mention.), database indexes are often b-trees (food for thought). My recommendation: consider all your options. Consider other data structures and approaches... like an old-fashioned trie (for storing and retrieving strings) or an R-tree (for spatial data) or a MA-FSA (Minimal Acyclic Finite State Automaton - small storage footprint).

Given the comments, you might assume that people who use hash tables are deranged. Are hash tables reckless and dangerous? Are these people insane?

Turns out they're not. Just as binary trees are good at certain things (in-order data traversal, storage efficiency), hash tables have their moment to shine as well. In particular, they can be very good at reducing the number of reads required to fetch your data. A hash algorithm can generate a location and jump straight to it in memory or on disk while binary search reads data during each comparison to decide what to read next. Each read has the potential for a cache miss which is an order of magnitude (or more) slower than a CPU instruction.

That's not to say hash tables are better than binary search. They're not. It's also not to suggest that all hash and binary search implementations are the same. They're not. If I have a point, it's this: both approaches exist for a reason. It's up to you to decide which is best for your needs.

Original answer:


Hash algorithms are O(1) while binary search is O(log n). So as n approaches infinity, hash performance improves relative to binary search. Your mileage will vary depending on n, your hash implementation, and your binary search implementation.

Interesting discussion on O(1). Paraphrased:

O(1) doesn't mean instantaneous. It means that the performance doesn't change as the size of n grows. You can design a hashing algorithm that's so slow no one would ever use it and it would still be O(1). I'm fairly sure .NET/C# doesn't suffer from cost-prohibitive hashing, however ;)

查看更多
老娘就宠你
6楼-- · 2019-01-12 19:10

I strongly suspect that in a problem set of size ~1M, hashing would be faster.

Just for the numbers:

a binary search would require ~ 20 compares (2^20 == 1M)

a hash lookup would require 1 hash calculation on the search key, and possibly a handful of compares afterwards to resolve possible collisions

Edit: the numbers:

    for (int i = 0; i < 1000 * 1000; i++) {
        c.GetHashCode();
    }
    for (int i = 0; i < 1000 * 1000; i++) {
        for (int j = 0; j < 20; j++)
            c.CompareTo(d);
    }

times: c = "abcde", d = "rwerij" hashcode: 0.0012 seconds. Compare: 2.4 seconds.

disclaimer: Actually benchmarking a hash lookup versus a binary lookup might be better than this not-entirely-relevant test. I'm not even sure if GetHashCode gets memoized under-the-hood

查看更多
成全新的幸福
7楼-- · 2019-01-12 19:11

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code. Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

  • Get the hash of the key
  • Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
  • Traverse the items chain (basically it's a list of items that share the same bucket, most hashtables use this method of handling bucket/hash collisions) that starts at that bucket and compare each key with the one of the item you are trying to add/delete/update/check if contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

查看更多
登录 后发表回答