How to find a word in large word list (vocabulary)

2020-05-27 05:55发布

问题:

Problem

[Here follows a description of what the app should do under which constrains]

I want a data-structure that searches if a string exists in a 250,000 word-list, while using only a fair amount of ram and keeping the time it takes to load this data-structure into ram small (let's say 0-8 seconds). The time it takes to find a word should also be quick (let's say 0 to 0.5 second), but ram usage is more important. It should also be possible to create multiple games (more on what this game is about at the title "use") without needing significant more memory.

It would also be highly valuable to know which words start with a string, but not enough so to sacrifice load-time by many seconds.


Use

It is for an Android offline game. Limited ram is available. The maximum amount of ram an Application can use according to this post is between 16-32mb ram depending on the device. My empty Android Application already uses about 17mb (using Memory Monitor in Android Studio). My android device caps the ram usage off at 26mb, leaving me at about 8mb of free space for my whole Activity.


Options I tried

They all seem doomed in different ways.

  1. Hashmap - Read all words into a hash-map object.

    1.1 initialize speed: slow to read each word into the Hash-map with 23 seconds.

    1.2 ram usage: uses significant amount of ram, although I forgot how much exactly.

    1.3 search speed: Finding if a word existed in the list was quick of course.

    1.4 narrowing down on possible words (optional): slow, needs to go through the whole hash-map and delete them one by one. Also because it's using deletion, multiple games won't be able to be played using the same instance of the hash-map. Too much memory would be taken when adding more games, making narrowing down on the possible words therefor impossible.

  2. Trie - Implement a RadixTree & You can see my implementation here.

    2.1 initialize speed: slow to read each word into the RadixTree with 47 seconds.

    2.2 ram usage: uses significant amount of ram, so much that Android is suspending threads a couple of times.

    2.3 search speed: Finding if a word existed in the list was quick.

    2.4 narrowing down on possible words (optional): Ultra fast since only a reference to a node in the tree is needed to then find all possible words as its children. You can play a lot of games with narrowing down the possible words since an extra game requires only a reference to a node in the tree!

  3. Scanner - Go through the word-file sequentially

    3.1 initialize speed: none.

    3.2 ram usage: none.

    3.3 search speed: about 20 seconds.

    3.4 narrowing down on possible words (optional): can't be done realistically.

simple code:

String word;
String wordToFind = "example";
boolean foundWord = false;

while (wordFile.hasNextLine()) {
    word = wordFile.nextLine();
    if(word.equals(wordToFind)) {
        foundWord = true;
        break;
    }
}

test.close();

Options I thought of:

  1. Long-binary-search-tree: Converting the word-list to a list of longs then reading these in and doing a binary search on them.

    1.1 initialize speed: probably the same as a hash-map or little less with about 20 seconds. However I hope calling Array.sort() does not take too much time, no idea as of yet.

    1.2 ram usage: if you only account 12 letter words or lower with a 26 letter alphabet you need 5 bits (2^5= 32) to encode a string. An array of longs would need then 250,000*8 bits = around 2mb. Which is not too much.

    1.3 search speed: Arrays.binarySearch()

    1.4 narrowing down on possible words (optional): Narrowing down on possible words could be possible but I am not sure how. According to a comment on this post.

  2. Hashmap with storage - Creating a hashfunction that maps a word to an index number of the word-list file. Then accessing the file at this specific location and look from here to find if a word exists. You can make use of the ordering of the alphabet to determine if you can still find the word since the word-list is in natural order.

    2.1 initialize speed: not needed (since I need to put every word at the right index beforehand.)

    2.2 ram usage: none.

    2.3 search speed: fast.

    2.4 narrowing down on possible words (optional): not possible.


Specific questions I have

  1. Are the options I have thought of in the "Options I have thought of" section viable options or are there things I missed yet which would make them not possible to implement?
  2. Are there options I have not thought of which are better/equal in performance?

End remarks

I have been stuck at this for about a week now. So any new ideas are more than welcome. If any of my assumption above are incorrect I would also be pleased to hear about them.

I made this post this way so others could learn from them as well, either by seeing my mistakes or seeing what does work in the answers.

回答1:

This sounds like an ideal use for a bloom filter. If you're willing to allow the risk of something being falsely considered a word, you can condense your wordlist into an amount of memory as small or as large as you're willing to make it.



回答2:

I had this same issue and ended up going with an "on-disk" trie. That is, I encode the data structure into a single file using byte offsets instead of pointers (packing the nodes in reverse order, with the "root" node being the last written).

It is fast to load by simply reading the file into a byte array, with trie traversal using offset values the same way it would pointers.

My 200K word set fits in 1.7 MB (uncompressed) with a 4 byte value in each word terminating node.