Intro
I have the following SQLite table with 198,305 geocoded portuguese postal codes:
CREATE TABLE "pt_postal" (
"code" text NOT NULL,
"geo_latitude" real(9,6) NULL,
"geo_longitude" real(9,6) NULL
);
CREATE UNIQUE INDEX "pt_postal_code" ON "pt_postal" ("code");
CREATE INDEX "coordinates" ON "pt_postal" ("geo_latitude", "geo_longitude");
I also have the following user defined function in PHP that returns the distance between two coordinates:
$db->sqliteCreateFunction('geo', function ()
{
if (count($data = func_get_args()) < 4)
{
$data = explode(',', implode(',', $data));
}
if (count($data = array_map('deg2rad', array_filter($data, 'is_numeric'))) == 4)
{
return round(6378.14 * acos(sin($data[0]) * sin($data[2]) + cos($data[0]) * cos($data[2]) * cos($data[1] - $data[3])), 3);
}
return null;
});
Only 874 records have a distance from 38.73311, -9.138707
smaller or equal to 1 km.
The Problem
The UDF is working flawlessly in SQL queries, but for some reason I cannot use it's return value in WHERE
clauses - for instance, if I execute the query:
SELECT
"code",
geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
FROM "pt_postal" WHERE 1 = 1
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
AND "distance" <= 1
ORDER BY "distance" ASC
LIMIT 2048;
It returns 1035 records ordered by distance
in ~0.05 seconds, however the last record has a "distance" of 1.353
km (which is bigger than the 1 km I defined as the maximum in the last WHERE
).
If I drop the following clauses:
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
Now the query takes nearly 6 seconds and returns 2048 records (my LIMIT
) ordered by distance
. It's supposed take this long, but it should only return the 874 records that have "distance" <= 1
.
The EXPLAIN QUERY PLAN
for the original query returns:
SEARCH TABLE pt_postal USING INDEX coordinates (geo_latitude>? AND geo_latitude<?)
#(~7500 rows)
USE TEMP B-TREE FOR ORDER BY
And without the coordinate boundaries:
SCAN TABLE pt_postal
#(~500000 rows)
USE TEMP B-TREE FOR ORDER BY
What I Would Like to Do
I think I know why this is happening, SQLite is doing:
- use index
coordinates
to filter out the records outside of the boundaries in theWHERE
clauses - filter those records by the
"distance" <= 1
WHERE
clause, butdistance
is stillNULL => 0
! - populate "code" and "distance" (by calling the UDF for the first time)
- order by the "distance" (which is populated by now)
- limit the records
What I would like SQLite to do:
- use index
coordinates
to filter out the records outside of the boundaries in theWHERE
clauses - for those records, populate
code
anddistance
by calling the UDF - filter the records by the
"distance" <= 1
WHERE
clause - order by the "distance" (without calling the UDF again)
- limit the records
Can anyone explain how I can make SQLite behave (if it's even possible) the way I want it to?
Postscript
Just out of curiosity, I tried to benchmark how much slower calling the UDF twice would be:
SELECT
"code",
geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
FROM "pt_postal" WHERE 1 = 1
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
AND geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") <= 1
ORDER BY "distance" ASC
LIMIT 2048;
To my surprise, it still runs in the same ~0.06 seconds - and it still (wrongly!) returns the 1035 records.
Seems like the second geo()
call is not even being evaluated... But it should, right?
I cannot tell from the documentation whether or not
sqliteCreateFunction
defines an aggregate, likeSUM
, or a scalar, likesqrt
. Aggregate functions cannot be referenced in aWHERE
clause;HAVING
is required.Per the SQLite UDF documentation, you need to know if only xFunc is populated, or if xStep and xFinal are. Those are the pointers SQLite uses to know the kind of function you're defining, and thus whether or not to honor it in a
WHERE
clause.This query (provided by @OMGPonies):
Correctly returns the 873 records, ordered by
distance
in ~0.07 seconds.However, I'm still wondering why SQLite doesn't evaluate
geo()
in theWHERE
clause, like MySQL...Basically, I was using
sprintf()
to see what kind of bounding coordinates where being computed, and since I couldn't run the query on any place other than PHP (because of the UDF) I was generating another query with prepared statements. The problem was, I wasn't generating the last bound parameter (the kilometers in thedistance <= ?
clause) and I was fooled by mysprintf()
version.Guess I shouldn't try to code when I'm sleepy. I'm truly sorry for your wasted time, and thank you all!
Just for the sake of completeness, the following returns (correctly!) 873 records, in ~ 0.04 seconds:
This also return 873 records, ordered by
distance
in ~0.04 seconds:The reason this page doesn't have a
GROUP BY
clause is MySQL specific:If no primary / unique key is available, the following hack also works (albeit a bit slower - ~0.16 seconds):