I would like to be able to search a table as follows for smith as get everything that it within 1 variance.
Data:
O'Brien Smithe Dolan Smuth Wong Smoth Gunther Smiht
I have looked into using Levenshtein distance does anyone know how to implement this with it?
If you only want to know if the levenshtein-distance is at most 1, you can use the following MySQL function.
This is basically a single step in the recursive description of the levenshtein distance. The function returns 1, if the distance is at most 1, else it returns 0.
Since this function does not completely compute the levenshtein-distance, it is much faster.
You can also modify this function such that it returns
true
if the levenshtein-distance is at most 2 or 3, by calling it self recursively. If MySQL does not support recursive calls, you can copy slightly modified versions of this function two times and call them instead. But you should not use the recursive function to calculate the exact levenshtein-distance.Based on Chella's answer and Ryan Ginstrom's article, a fuzzy search could be implemented as so:
In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.
I had a specialized case of k-distance searching and after installing the Damerau-Levenshtein UDF in MySQL found that the query was taking too long. I came up with the following solution:
Create a new table (or append columns to your target table) with columns for each character position in your target field. ie. My VARCHAR(9) ended up as 9 TINYINT columns + 1 Id column that matches my main table (add indexes for each column). I added triggers to ensure that these new columns always get updated when my main table gets updated.
To perform a k-distance query use the following predicate:
(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + ... >= m
where s is your search string and m is the required number of matching characters (or m = 9 - d in my case where d is the maximum distance I want returned).
After testing I found that a query over 1 million rows that was taking 4.6 seconds on average was returning matching ids in less than a second. A second query to return the data for the matching rows in my main table similarly took under a second. (Combining these two queries as a subquery or join resulted in significantly longer execution times and I'm not sure why.)
Though this is not Damerau-Levenshtein (doesn't account for substitution) it suffices for my purposes.
Though this solution probably doesn't scale well for a larger (length) search space it worked for this restrictive case very well.
I am setting up a search based on Levenshtein or Damerau-Levenshtein (probably the latter) for multiple searches over an indexed text, based on a paper by by Gonzalo Navarro and Ricardo Baeza-yates: link text
After building a suffix array (see wikipedia), if you are interested in a string with at most k mismatches to the search string, break the search string into k+1 pieces; at least one of those must be intact. Find the substrings by binary search over the suffix array, then apply the distance function to the patch around each matched piece.
There is a mysql UDF implementation of Levenshtein Distance function
https://github.com/jmcejuela/Levenshtein-MySQL-UDF
It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader