Anagrams - Hashing with chaining and probing in C

2020-03-26 06:29发布

问题:

My title got edited, so I wanted to make sure everyone knows this is homework. The problem is just to optimize the program, the hashing is my idea.

--

I'm working on optimizing a C program that groups together words that are anagrams of each other, and then prints them out.

Currently the program is basically a linked list of linked lists. Each link in the outer list is a group of words that are anagrams of each other.

The profile for the program shows that by far, the largest portion of execution time is the function wordLookup. This is because it has to search every node, and with a possible 100k words read in from a file, this can take a very long time. For instance, here is the gprof output for reading in 40k words:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
100.31      1.48     1.48    40000    37.12    37.12  wordLookup
  0.00      1.48     0.00    78235     0.00     0.00  newnode
  0.00      1.48     0.00    40000     0.00     0.00  sort_string
  0.00      1.48     0.00    38235     0.00     0.00  wordInsert
  0.00      1.48     0.00     1996     0.00     0.00  swap_words
  0.00      1.48     0.00     1765     0.00     0.00  wordAppend

My idea for making this faster is to change the data structure to a hash table that chains all anagrams of each other in the same slot.

Based on things my professor has said and things that I've read here, I'm thinking of something like this for my hash function. (Note: the prime numbers are distributed such that the most used letters are low numbers and the least used are high numbers.)

sort(string)

array alpha_primes = 5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101
hash(String) {
  hash = 1
  for (char in String) {
    hash *= alpha_primes[char-'a'];
  }
  return hash % tablesize
}

Is there a hash table size for this problem that would appropriately distribute the values such that each group of anagrams has a distinct index in the table?

If that is not possible, then should I:

  • chain the lists of words together (list of lists)
  • use a probing (linear or quadratic) solution
  • For either of these scenarios, what are the upsides/downsides when compared?

回答1:

There is no way to guarantee that hashes will be unique. The probability of a collision can be calculated via the birthday problem, and your best bet is to minimize it.

The probability for 2 groups to hash to the same value can be approximated as 1-e^((-k(k-1))/2n), where k is the total amount of groups you have (roughly the same as your word count), and n is the search space of your hash (2^(length of your hash)).

My dictionary has roughly 100000 words, making a 32b hash very good (2% of colissions). However, a hash table that big would use 4GB of RAM. Using a smaller table means more colissions. Chaining or probing won't make a huge difference in time.

As reccomended in comments to your question, a trie will end up in a smaller data structure overall.



标签: c hash anagram