How to optimize this Levenshtein distance calculat

2019-07-31 13:51发布

问题:

Table a has around 8,000 rows and table b has around 250,000 rows. Without the levenshtein function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.

SELECT
      *
   FROM
      library a,
      classifications b
   WHERE  
      a.`release_year` = b.`year`
      AND a.`id` IS NULL
      AND levenshtein_ratio(a.title, b.title) > 82

回答1:

I'm assuming that levenshtein_ratio is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:

SELECT
      count(*)
   FROM
      library a,
      classifications b
   WHERE  
      a.`release_year` = b.`year`
      AND a.`id` IS NULL

That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.



回答2:

You are giving too little information to actually help you.

1) My first guess would be to try to create other WHERE conditions that reduce the amount of rows to be scanned.

2) If that is not possible...Given that the titles from table library and classifications are known, one idea would be to create a table where all the data is already calculated like this:

TABLE levenshtein_ratio
id_table_library
id_table_classifications
precalculated_levenshtein_ratio

so you would populate the table using this query:

insert into levenshtein_ratio select a.id, b.id, levenshtein_ratio(a.title, b.title) from library, classifications

and then your query would be:

    SELECT
          *
       FROM
          library a LEFT JOIN 
          classifications b ON a.`release_year` = b.`year`

LEFT JOIN levenshtein_ratio c ON c.id_table_library = a.id AND c.id_table_classifications = b.id
       WHERE  
          a.`id` IS NULL
          AND precalculated_levenshtein_ratio > 82

this query will probably then no more than the original 2 secs.

The problem with this solution is the fact that the data in tables a and b can change, so you will need to create a trigger to keep it updated.



回答3:

Change your query to use proper joins (syntax has been around since 1996).

Also, all your levensrein condition may be moved into the join condition, which should give you a performance benefit:

SELECT *
FROM library a
JOIN classifications b
    ON a.`release_year` = b.`year`
    AND levenshtein_ratio(a.title, b.title) > 82
WHERE a.`id` IS NULL

Also, make sure there's an index on b.year:

create index b_year on b(year);