Implementation of Levenshtein distance for mysql/f

2019-01-01 10:53发布

I would like to be able to search a table as follows for smith as get everything that it within 1 variance.

Data:

O'Brien
Smithe
Dolan
Smuth
Wong
Smoth
Gunther
Smiht

I have looked into using Levenshtein distance does anyone know how to implement this with it?

9条回答
回忆,回不去的记忆
2楼-- · 2019-01-01 11:32

If you only want to know if the levenshtein-distance is at most 1, you can use the following MySQL function.

CREATE FUNCTION `lv_leq_1` (
`s1` VARCHAR( 255 ) ,
`s2` VARCHAR( 255 )
) RETURNS TINYINT( 1 ) DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i INT;
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), i = 1;
    IF s1 = s2 THEN
        RETURN TRUE;
    ELSEIF ABS(s1_len - s2_len) > 1 THEN
        RETURN FALSE;
    ELSE
        WHILE SUBSTRING(s1,s1_len - i,1) = SUBSTRING(s2,s2_len - i,1) DO
            SET i = i + 1;
        END WHILE;
        RETURN SUBSTRING(s1,1,s1_len-i) = SUBSTRING(s2,1,s2_len-i) OR SUBSTRING(s1,1,s1_len-i) = SUBSTRING(s2,1,s2_len-i+1) OR SUBSTRING(s1,1,s1_len-i+1) = SUBSTRING(s2,1,s2_len-i);
    END IF;
END

This is basically a single step in the recursive description of the levenshtein distance. The function returns 1, if the distance is at most 1, else it returns 0.

Since this function does not completely compute the levenshtein-distance, it is much faster.

You can also modify this function such that it returns true if the levenshtein-distance is at most 2 or 3, by calling it self recursively. If MySQL does not support recursive calls, you can copy slightly modified versions of this function two times and call them instead. But you should not use the recursive function to calculate the exact levenshtein-distance.

查看更多
公子世无双
3楼-- · 2019-01-01 11:34

Based on Chella's answer and Ryan Ginstrom's article, a fuzzy search could be implemented as so:

DELIMITER $$
CREATE FUNCTION fuzzy_substring( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
        RETURN 0;
    ELSEIF s1_len = 0 THEN
        RETURN s2_len;
    ELSEIF s2_len = 0 THEN
        RETURN s1_len;
    ELSE
        WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(0))), j = j + 1;
        END WHILE;
        WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN
                    SET cost = 0; ELSE SET cost = 1;
                END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN
                    SET c = c_temp;
                END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
        END WHILE;
    END IF;
    SET j = 1;
    WHILE j <= s2_len DO
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10);
        IF c > c_temp THEN
            SET c = c_temp;
        END IF;
        SET j = j + 1;
    END WHILE;
    RETURN c;
END$$
DELIMITER ;
查看更多
只靠听说
4楼-- · 2019-01-01 11:36

In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.

查看更多
荒废的爱情
5楼-- · 2019-01-01 11:37

I had a specialized case of k-distance searching and after installing the Damerau-Levenshtein UDF in MySQL found that the query was taking too long. I came up with the following solution:

  • I have a very restrictive search space (9 character string limited to numeric values).

Create a new table (or append columns to your target table) with columns for each character position in your target field. ie. My VARCHAR(9) ended up as 9 TINYINT columns + 1 Id column that matches my main table (add indexes for each column). I added triggers to ensure that these new columns always get updated when my main table gets updated.

To perform a k-distance query use the following predicate:

(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + ... >= m

where s is your search string and m is the required number of matching characters (or m = 9 - d in my case where d is the maximum distance I want returned).

After testing I found that a query over 1 million rows that was taking 4.6 seconds on average was returning matching ids in less than a second. A second query to return the data for the matching rows in my main table similarly took under a second. (Combining these two queries as a subquery or join resulted in significantly longer execution times and I'm not sure why.)

Though this is not Damerau-Levenshtein (doesn't account for substitution) it suffices for my purposes.

Though this solution probably doesn't scale well for a larger (length) search space it worked for this restrictive case very well.

查看更多
孤独寂梦人
6楼-- · 2019-01-01 11:49

I am setting up a search based on Levenshtein or Damerau-Levenshtein (probably the latter) for multiple searches over an indexed text, based on a paper by by Gonzalo Navarro and Ricardo Baeza-yates: link text

After building a suffix array (see wikipedia), if you are interested in a string with at most k mismatches to the search string, break the search string into k+1 pieces; at least one of those must be intact. Find the substrings by binary search over the suffix array, then apply the distance function to the patch around each matched piece.

查看更多
看风景的人
7楼-- · 2019-01-01 11:50

There is a mysql UDF implementation of Levenshtein Distance function

https://github.com/jmcejuela/Levenshtein-MySQL-UDF

It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader

查看更多
登录 后发表回答