Intro
I have the following SQLite table with 198,305 geocoded portuguese postal codes:
CREATE TABLE "pt_postal" (
"code" text NOT NULL,
"geo_latitude" real(9,6) NULL,
"geo_longitude" real(9,6) NULL
);
CREATE UNIQUE INDEX "pt_postal_code" ON "pt_postal" ("code");
CREATE INDEX "coordinates" ON "pt_postal" ("geo_latitude", "geo_longitude");
I also have the following user defined function in PHP that returns the distance between two coordinates:
$db->sqliteCreateFunction('geo', function ()
{
if (count($data = func_get_args()) < 4)
{
$data = explode(',', implode(',', $data));
}
if (count($data = array_map('deg2rad', array_filter($data, 'is_numeric'))) == 4)
{
return round(6378.14 * acos(sin($data[0]) * sin($data[2]) + cos($data[0]) * cos($data[2]) * cos($data[1] - $data[3])), 3);
}
return null;
});
Only 874 records have a distance from 38.73311, -9.138707
smaller or equal to 1 km.
The Problem
The UDF is working flawlessly in SQL queries, but for some reason I cannot use it's return value in WHERE
clauses - for instance, if I execute the query:
SELECT
"code",
geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
FROM "pt_postal" WHERE 1 = 1
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
AND "distance" <= 1
ORDER BY "distance" ASC
LIMIT 2048;
It returns 1035 records ordered by distance
in ~0.05 seconds, however the last record has a "distance" of 1.353
km (which is bigger than the 1 km I defined as the maximum in the last WHERE
).
If I drop the following clauses:
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
Now the query takes nearly 6 seconds and returns 2048 records (my LIMIT
) ordered by distance
. It's supposed take this long, but it should only return the 874 records that have "distance" <= 1
.
The EXPLAIN QUERY PLAN
for the original query returns:
SEARCH TABLE pt_postal USING INDEX coordinates (geo_latitude>? AND geo_latitude<?)
#(~7500 rows)
USE TEMP B-TREE FOR ORDER BY
And without the coordinate boundaries:
SCAN TABLE pt_postal
#(~500000 rows)
USE TEMP B-TREE FOR ORDER BY
What I Would Like to Do
I think I know why this is happening, SQLite is doing:
- use index
coordinates
to filter out the records outside of the boundaries in theWHERE
clauses - filter those records by the
"distance" <= 1
WHERE
clause, butdistance
is stillNULL => 0
! - populate "code" and "distance" (by calling the UDF for the first time)
- order by the "distance" (which is populated by now)
- limit the records
What I would like SQLite to do:
- use index
coordinates
to filter out the records outside of the boundaries in theWHERE
clauses - for those records, populate
code
anddistance
by calling the UDF - filter the records by the
"distance" <= 1
WHERE
clause - order by the "distance" (without calling the UDF again)
- limit the records
Can anyone explain how I can make SQLite behave (if it's even possible) the way I want it to?
Postscript
Just out of curiosity, I tried to benchmark how much slower calling the UDF twice would be:
SELECT
"code",
geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
FROM "pt_postal" WHERE 1 = 1
AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
AND geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") <= 1
ORDER BY "distance" ASC
LIMIT 2048;
To my surprise, it still runs in the same ~0.06 seconds - and it still (wrongly!) returns the 1035 records.
Seems like the second geo()
call is not even being evaluated... But it should, right?