Choose Trie or HashMap for storing a word frequenc

2020-06-06 01:51发布

问题:

I have a txt file containing 1 million English word with their frequencies in this format:

good 345667
bad 456777
...

I need to store it using either a HashMap or a Trie data structure in Java. Later on I need to look up words from the list without other operations. My understanding is that, the look up is slower for HashMap than Trie, but Trie will take up more memory usage, and the implementation of a Trie also takes effort, while HashMap already is ready for use. For production code, do you have any advice or suggestions on what data structures best suit for this situation? Thanks in advance.

Also, HashMap allows for "constant time" for lookup. Is it really slower than a Trie for English words?

回答1:

My understanding is that, the look up is slower for HashMap than Trie, but Trie will take up more memory usage

This is incorrect. Assuming a good hash function, a lookup in a HashMap will require a small constant number of random accesses to main memory, irrespective of the size of the table, or the length of its keys. A trie, in contrast, will require an access to main memory for each letter in the key. Therefore, a trie will cause more cache misses - and in cache misses will dominate the overall lookup cost on modern hardware.

A trie can save memory if the keys are long and share many common prefixes.

A trie also supports prefix queries.

In your case, keys are short, and you don't need prefix queries, so you won't benefit from a trie.



回答2:

Given a good hash function (which the String class surely has,) a Hashmap will have faster lookup time than a Trie.

From Wikipedia, you'll read:

Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.

So, a Hashmap with many collisions is slower than a trie. However, that only happens when your key has a poor hash function. If you're using String objects as the key, you won't have that problem.

A Trie will save you memory. Exactly how much will depend on the composition of your data. If the data is similar, you'll have greater memory savings. If the data is varied, there'll be less savings. This is because the prefixes are shared for strings with common prefixes.

So if memory is adequate, and you've a good hash function, use a Hashmap.

Otherwise, use a Trie.



回答3:

I guess the operative word here is "million". For that many entries hashing begins to suffer performance problems, while a trie maintains it's log(N) characteristic, even if the machine begins paging heavily. And a trie is more suited for a disk-based table (with caching).

But implementing an efficient (and reliable) trie is fairly difficult. Not for the faint of heart.



回答4:

1 million is not really such a big number these days for the number of entries in an in-memory data structure, at least on a server, desktop or laptop. On a phone or tab/pad it may become painful.

Implementing an efficient trie is anything but trivial and may end up counter to what you hope for with regards of performance and memory usage. Just imagine: in every node you need a jump-table which branches on potentially every character to a child node. What is your potential character set: all of Unicode, european, ascii, lowercase and uppercase, only lowercase. The further to the left your answer is, the larger the jump-tables become. But even with just lowercase a-z you need a jumptable in every node with up to 26 entries. Speed requires to reserve 26*4 bytes in every node. Space efficiency rather pushes you to store the table somehow sparse. Higher up in the trie, likely all slots are needed and a sparse array would be a waste of space and time. Closer to the leaves, less and less slots need to point to child nodes and stay empty, so a full and fast table would be a waste of space.

Java's HashMap has quite some history and is probably one of the best tested, commented, critized and improved implementations of a hash map available. For your requirement I would clearly start with it, possibly experiment a bit with the loadFactor and only if you run into serious problems provably due to the HashMap, I would invest time in a trie.