Hash table vs Balanced binary tree [closed]

2019-01-21 03:00发布

问题:

What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?

回答1:

This question cannot be answered, in general, I fear.

The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.

So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.

For a more elaborate answer, let's consider some alternatives.

Hash Table (see Wikipedia's entry for some basics)

  • Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
  • Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
  • There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...

Binary Tree

  • Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
  • A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
  • B-Tree and variants also offer "packing"

Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...

Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.



回答2:

Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.



回答3:

A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.

In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.

On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.

Binary trees are easier to implement in purely functional languages.

Binary trees have a natural sort order and a natural way to walk the tree for all elements.

When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.

Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).

Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.



回答4:

A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.

If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.

The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.

Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.

These are factors to take into account in deciding which implementation to use:

  1. Availability of a total order relationship.
  2. Availability of a good hashing function for the equivalence relationship.
  3. A-priory knowledge of the number of elements.
  4. Knowledge about the rate of insertions, deletions, and lookups.
  5. Relative complexity of the comparison and hashing functions.


回答5:

Hash tables are faster lookups:

  • You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
  • Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).

Binary trees:

  • Deterministic. O(log n) I think...
  • Don't need extra space like hash tables can
  • Must be kept sorted. Adding an element in the middle means moving the rest around.


回答6:

If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.



回答7:

To add to the other great answers above, I'd say:

Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.



回答8:

One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:

def bar(table):
    # some intern stuck this line of code in
    table["hello"] = "world"
    return table["the answer"]

def foo(x, y, table):
    z = bar(table)
    if "hello" in table:
        raise Exception("failed catastrophically!")
    return x + y + z

important_result = foo(1, 2, {
    "the answer": 5,
    "this table": "doesn't contain hello", 
    "so it should": "be ok"
})
# catastrophic failure occurs

With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.

So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).

This means that an efficient tree can be very convenient when immutable maps are desired.



回答9:

If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.



回答10:

In my experience, hastables are always faster because trees suffer too much of cache effects.

To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/

Here you can see compared the performance of the most common hashtable, tree and trie libraries available.



回答11:

One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.