I have a list of cities that have numerous incorrect spelling for the same city. One city is misspelled 18 times! I am trying to clean this up but its taking hours. Is there some algorithm that might "guess" at the valid city name for each of these misspelled ones? Some form of weighting? The data is in MySQL and I do have a table of the correct spelling as well to compare against.
Any ideas on this? A PHP example would help if possible.
Read about Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance.
Find an implementation or write your own. It's not that complex.
Use it to locate near-miss spelling errors.
As the infamous "Bariteney Spears" suggests, and as most us know from experience, Google, through it's crawling, has gotten pretty good at correctly spell checking popular names. City are the things it usually corrects well. You might thus try to write a PHP function that parses a Googlse search page to see what correction Google suggests, or, more complicated (because the page is more complex), you might even try to parse the options given to you by Google Maps.
you could use a damerau-levenstein function to get the string distance between two strings. (This also checks for transposition)
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
If your tables are big then you might need to optimise the algo a bit to break once the string distance exceeds a threashold.
Also if you can assume that the first letter of the city is typed correctly that should reduce the number of comparisons to.
Its not PHP, but if its of any help heres a java version I wrote: