For those of you not familiar with interpolation search, it is method to search for a value in a sorted array that is potentially faster than binary search. You look at the first and last element and (assuming that the contents of the array are uniformly distributed) linearly interpolate to predict the location.
For example: we have an array of length 100 with array[0]=0 and array[99]=99. If we are looking for 80, it is intuitive to try array[80] over array[50], and if the array is close to uniformly distributed, the expected runtime is reduced to log(log(N))
For numbers, the location to check is defined by the equation:
low + ((toFind - sortedArray[low]) * (high - low + 1)) / (sortedArray[high] - sortedArray[low])
.
A common example used to show off the intuitive nature of interpolation search is: imagine trying to find the word 'yellow' in a dictionary. You wouldn't use binary search and go to the half way point. Rather, you would go to the expected location.
Humans can naturally linearly interpolate strings, but I can't figure out how code it up.
How do we linearly interpolate strings?
To find the "distance" between two strings, a simple method would be to look at the first letter that is different between them and assign a numeric value to each, then take the difference.
For example, the distance from "a" to "y" would be 24 and the distance from "y" to "z" would be 1, if each letter were assigned a value equal to its position in the alphabet.
A better performing method would go through a dictionary to weight the various letters by how common they are in actual words.
Another refinement would be to look at two characters - "aa" is farther from "bz" than "az" is from "ba", for example. Going beyond two characters wouldn't buy you much.
The reason this method isn't more popular is that it complicates the binary search algorithm for not a lot of gain. If you were to time it you might even find that standard binary search is faster; what you gain in fewer comparisons you lose in the complexity of determining distances.
Also note that the worst-case performance of this algorithm is worse than a binary search. Consider for example searching for "ae" in the list of "aa","ab","ac","ad","ae","zz" - the outlier "zz" is going to bias the search so that it's always trying the beginning of the search range. It degrades to O(n) under these conditions.