In SQL databases (I use Python+Sqlite), how to make sure that, if we have 1 million rows, the query
SELECT * FROM mytable WHERE myfunction(description) < 500 AND column2 < 1000
[-----------------------------] [--------------]
high-CPU cost condition easy-to-test
requiring 100 µs per test condition
is optimized so that the 1st condition (CPU-expensive) is only tested if the easy-to-test second condition is already True? (since it's a logical AND
, is it a lazy AND
?)
Example:
if the 1st condition is always tested, it would require 1 million x 100 µs = 100 seconds!
if the 2nd condition is tested first, then only 5000 items would be pre-filtered (in my use-case), and then, applying the 1st condition would be very fast.
Note:
column2 is not necessary an ID, it could be something else
in my use case,
myfunction
involves Levenshtein distance computation
(Updated answer based on comments and subsequent testing.)
The actual answer to your question
depends on
A simple test should tell you whether your query would be sufficiently "optimized" for your needs. The good news is that SQLite will perform the easy (inexpensive) condition first, at least under certain circumstances.
For a test table "mytable"
containing a million rows
the Python test code
returns
and counting the lines in log_file.txt we see
indicating that our function was only called one thousand times, not one million times. SQLite has clearly applied the
column2 < 1000
first, and then applied themyfunction(description) < 500
condition on the subset of rows from the first condition.(Original "off the cuff" answer.)
The actual answer to your question depends on how clever the query optimizer is. A simple test should tell you whether your query would be sufficiently "optimized" for your needs.
However, you do have a couple of options if your tests find that your original approach is too slow:
Option 1: Try doing the simple comparison "first"
Changing the order might affect the query plan, e.g.
might turn out to be faster than
Option 2: Try forcing the order using a subquery
Again, depending on the cleverness of the query optimizer
might apply the inexpensive condition first, then apply the expensive condition on the resulting subset of rows. (However, a comment indicates that SQLite is too sophisticated to fall for that ploy.)
One way that you can force the order of execution is using a
case
expression. In general, SQL optimizers can re-arrange operations, the one exception iscase
.Generally,
case
expressions are discouraged inWHERE
clauses . . . one major reason is that they impede optimization. In this case, that is a good thing.SQLite will happily reorder AND-connected expressions whenever it feels like it. So while rewriting the query to check
column2
first appears to work in the current version, there is no guarantee.The query optimizer assumes that speed is determined mainly by disk I/O, so it estimates the cost of both conditions to be the same. Cost estimates are influenced by indexes, and by ANALYZE statistics (which work only for indexed data). So the easiest way to speed up this query (and probably most other queries you will use) is to create an index on
column2
:If you do not want to use an index for some reason, you have to use a construct that the query optimizer cannot optimize away. A CASE expression as shown in Gordon's answer would work just fine. In the general case, move the first condition into a subquery and prevent subquery flattening by breaking one of the listed rules; adding a dummy LIMIT clause to both queries usually is easiest:
Inspired by @GordThompson's answer, here is a benchmark between:
vs.
Test (1) (easy-to-test condition first): 1.02 seconds
Result: 1.02 seconds, it means that
myfunc
has been called max 1000 times, i.e. not for all the 10k rows.Test (2) (Slow-to-compute condition first): 10.05 seconds
Idem with:
instead.
Result: 10.05 seconds, it means that
myfunc
has been called ~ 10k times, i.e. for all the 10k rows, even those for which the conditioncol2 < 1000
is not True.Global conclusion: Sqlite does lazy evaluation for
AND
, i.e. the easy condition has to be written first like this: