Table a
has around 8,000 rows and table b
has around 250,000 rows. Without the levenshtein
function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.
SELECT
*
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
AND levenshtein_ratio(a.title, b.title) > 82
I'm assuming that levenshtein_ratio
is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:
SELECT
count(*)
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.
You are giving too little information to actually help you.
1) My first guess would be to try to create other WHERE conditions that reduce the amount of rows to be scanned.
2) If that is not possible...Given that the titles from table library and classifications are known, one idea would be to create a table where all the data is already calculated like this:
TABLE levenshtein_ratio
id_table_library
id_table_classifications
precalculated_levenshtein_ratio
so you would populate the table using this query:
insert into levenshtein_ratio select a.id, b.id, levenshtein_ratio(a.title, b.title) from library, classifications
and then your query would be:
SELECT
*
FROM
library a LEFT JOIN
classifications b ON a.`release_year` = b.`year`
LEFT JOIN levenshtein_ratio c ON c.id_table_library = a.id AND c.id_table_classifications = b.id
WHERE
a.`id` IS NULL
AND precalculated_levenshtein_ratio > 82
this query will probably then no more than the original 2 secs.
The problem with this solution is the fact that the data in tables a and b can change, so you will need to create a trigger to keep it updated.
Change your query to use proper joins (syntax has been around since 1996).
Also, all your levensrein condition may be moved into the join condition, which should give you a performance benefit:
SELECT *
FROM library a
JOIN classifications b
ON a.`release_year` = b.`year`
AND levenshtein_ratio(a.title, b.title) > 82
WHERE a.`id` IS NULL
Also, make sure there's an index on b.year:
create index b_year on b(year);