My title got edited, so I wanted to make sure everyone knows this is homework. The problem is just to optimize the program, the hashing is my idea.
--
I'm working on optimizing a C program that groups together words that are anagrams of each other, and then prints them out.
Currently the program is basically a linked list of linked lists. Each link in the outer list is a group of words that are anagrams of each other.
The profile for the program shows that by far, the largest portion of execution time is the function wordLookup
. This is because it has to search every node, and with a possible 100k words read in from a file, this can take a very long time. For instance, here is the gprof
output for reading in 40k words:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.31 1.48 1.48 40000 37.12 37.12 wordLookup
0.00 1.48 0.00 78235 0.00 0.00 newnode
0.00 1.48 0.00 40000 0.00 0.00 sort_string
0.00 1.48 0.00 38235 0.00 0.00 wordInsert
0.00 1.48 0.00 1996 0.00 0.00 swap_words
0.00 1.48 0.00 1765 0.00 0.00 wordAppend
My idea for making this faster is to change the data structure to a hash table that chains all anagrams of each other in the same slot.
Based on things my professor has said and things that I've read here, I'm thinking of something like this for my hash function. (Note: the prime numbers are distributed such that the most used letters are low numbers and the least used are high numbers.)
sort(string)
array alpha_primes = 5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101
hash(String) {
hash = 1
for (char in String) {
hash *= alpha_primes[char-'a'];
}
return hash % tablesize
}
Is there a hash table size for this problem that would appropriately distribute the values such that each group of anagrams has a distinct index in the table?
If that is not possible, then should I:
- chain the lists of words together (list of lists)
- use a probing (linear or quadratic) solution
- For either of these scenarios, what are the upsides/downsides when compared?