I am interested in algorithm in T-SQL calculating Levenshtein distance.
相关问题
- Bulk update SQL Server C#
- SQL to Parse a Key-Value String
- Get path of node in tree
- How to evaluate an input in the WHERE clause
- Delete all records in table which have no referenc
相关文章
- How to truncate seconds in TSQL?
- SQL identity (1,1) starting at 0
- SQL Group by Count of Counts
- How to generate sequential row number in tsql?
- Sql insert if row does not exist
- Why not “Invalid column name XYZ” error in subquer
- Using Dapper-dot-net, how do I map SQL uniqueident
- SQL Join on Table A value within Table B range
You can use Levenshtein Distance Algorithm for comparing strings
Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
(Function developped by Joseph Gama)
Usage :
The algorithm simply returns the stpe count to change one string into other by replacing a different character at one step
In TSQL the best and fastest way to compare two items are SELECT statements that join tables on indexed columns. Therefore this is how I suggest to implement the editing distance if you want to benefit from the advantages of a RDBMS engine. TSQL Loops will work too, but Levenstein distance calculations will be faster in other languages than in TSQL for large volume comparisons.
I have implemented the editing distance in several systems using series of Joins against temporary tables designed for that purpose only. It requires some heavy pre-processing steps - the preparation of the temporary tables - but it works very well with large number of comparisons.
In a few words: the pre-processing consists of creating, populating and indexing temp tables. The first one contains reference ids, a one-letter column and a charindex column. This table is populated by running a series of insert queries that split every word into letters (using SELECT SUBSTRING) to create as many rows as word in the source list have letters (I know, that's a lot of rows but SQL server can handle billions of rows). Then make a second table with a 2-letter column, another table with a 3-letter column, etc. The end results is a series of tables which contain reference ids and substrings of each the words, as well a the reference of their position in the word.
Once this is done, the whole game is about duplicating these tables and joining them against their duplicate in a GROUP BY select query which counts the number of matches. This creates a series of measures for every possible pair of words, which are then re-aggregated into a single Levenstein distance per pair of words.
Technically this is very different than most other implementations of the Levenstein distance (or its variants) so you need to deeply understand how the Levenstein distance works and why it was designed as it is. Investigate the alternatives as well because with that method you end up with a series of underlying metrics which can help calculate many variants of the editing distance at the same time, providing you with interesting machine learning potential improvements.
Another point already mentioned by previous answers in this page: try to pre process as much as possible to eliminate the pairs that do not require distance measurement. For example a pair of two words that have not a single letter in common should be excluded, because the editing distance can be obtained from the length of the strings. Or do not measure the distance between two copies of the same word, since it is 0 by nature. Or remove duplicates before doing the measurement, if your list of words comes from a long text it is likely that the same words will appear more than once, so measuring the distance only once will save processing time, etc.
IIRC, with SQL Server 2005 and later you can write stored procedures in any .NET language: Using CLR Integration in SQL Server 2005. With that it shouldn't be hard to write a procedure for calculating Levenstein distance.
A simple Hello, World! extracted from the help:
Then in your SQL Server run the following:
And now you can test run it:
Hope this helps.
Arnold Fribble had two proposals on sqlteam.com/forums
This is the younger one from 2006:
I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I'm aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a max edit distance is provided, the improvement in speed is significant. For example, when the inputs are two very similar 4000 character strings, and a max edit distance of 2 is specified, this is almost three orders of magnitude faster than the
edit_distance_within
function in the accepted answer, returning the answer in 0.073 seconds (73 milliseconds) vs 55 seconds. It's also memory efficient, using space equal to the larger of the two input strings plus some constant space. It uses a single nvarchar "array" representing a column, and does all computations in-place in that, plus some helper int variables.Optimizations:
The optimizations are described in a little more detail in my blog post on Levenshtein in TSQL and a link there to another post with a similar Damerau-Levenshtein implementation. But here is the code (updated 1/20/2014 to speed it up a bit more):
I was looking for a code example for the Levenshtein algorithm, too, and was happy to find it here. Of course I wanted to understand how the algorithm is working and I was playing around a little bit with one of the above examples I was playing around a little bit that was posted by Veve. In order to get a better understanding of the code I created an EXCEL with the Matrix.
distance for FUZZY compared with FUZY
Images say more than 1000 words.
With this EXCEL I found that there was potential for additional performance optimization. All values in the upper right red area do not need to be calculated. The value of each red cell results in the value of the left cell plus 1. This is because, the second string will be always longer in that area than the first one, what increases the distance by the value of 1 for each character.
You can reflect that by using the statement IF @j <= @i and increasing the value of @i Prior to this statement.