q-gram approximate matching optimisations

2020-05-21 06:39发布

问题:

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes).

I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and C is similar to B, does it imply C is similar to A?

Kind regards

Peter

回答1:

I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this

2-grams for "foobar": {"fo","oo","ob","ba","ar"}
  1. You could use positional q-grams:

    "foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
    

    The positional information can be used to determine if a matching q-gram is really a "good match".

    For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where

    2-gram "fo" exists in with position from 1 to 3 or
    2-gram "oo" exists in with position from 2 to 4 or
    ... and so on
    

    String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.

  2. Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since

    a string s has len(s)-q+1 q-grams

    and

    a single edit operation can affect at most q q-grams,

    we can deduce that

    strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.

    If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.

  3. Finally, the obvious thing to do is to filter by lenght. The edit distance between two strings is at least the difference of the lengths of the strings. For example if your threshold is 2 and you search for "foobar", "foobarbar" cannot obviously match.

See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.



回答2:

You have certainly seen the fuzzy text searches everywhere. For example you type "stck" but you actually mean "stack"! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. "q" in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. "Stack" will be broken into a set of {"st", "ta", "ac", "ck"} or "database" will be broken into {"da","at","ta","ab","ba","as","se"}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped "stck", any search for "stck" will not match "stack" because "a" is missing, but the 2-gram set {"st","tc","ck"} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of "stat" so we can easily suggest the user that he meant to type: first "stack" or second, "star".

Now let's implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word "stack" which will be broken to: {'st','ta','ac','ck'} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.



回答3:

Interesting paper about indexing DNA q-grams so you don't have to scan the entire table:

www.comp.nus.edu.sg/~atung/publication/qgram_edit.pdf



回答4:

I have a simple improvement which will not eliminate the scan, but speed it up if you use 2-grams or 3-grams only: replace the letters by numbers. Most SQL engines work a lot faster when comparing numbers.

Example: our source table contains text entries in one column. We create a temp table where we split the names in 2-grams using a

SELECT SUBSTRING (column, 1,2) as gram, 1 as position FROM sourcetable
UNION  
SELECT SUBSTRING (column, 2,2) as gram, 2 as position FROM sourcetable
UNION
SELECT SUBSTRING (column, 3,2) as gram, 3 as position FROM sourcetable

etc. 

This should run in a loop where i=0 and j=the max size of a source entry.

Then we prepare a mapping table which contains all possible 2-letter grams and include a IDENTITY (1,1) column called gram_id. We may sort the grams by frequency in the English dictionary and eliminate the most infrequent grams (like 'kk' or 'wq') - this sorting may take some time and research but it will assign the smallest numbers to the most frequent grams, which will then improve performance if we can limit the number of grams to 255 because then we can use a tinyint column for the gram_id.

Then we rebuild another temp table from the first one, where we use the gram_id instead of the gram. That becomes the master table. We create an index on the gram_id column and on the position column.

Then when we have to compare a text string to the master table, we first split the text string split it into 2-grams, then replace the 2-grams by their gram_id (using the mapping table), and compare them to the one of the master table

That makes a lot of comparisons, but most of them are 2-digit integers, which is very quick.