SQLite - WHERE Clause & UDFs

2019-03-06 00:06发布

问题:

Intro

I have the following SQLite table with 198,305 geocoded portuguese postal codes:

CREATE TABLE "pt_postal" (
  "code" text NOT NULL,
  "geo_latitude" real(9,6) NULL,
  "geo_longitude" real(9,6) NULL
);

CREATE UNIQUE INDEX "pt_postal_code" ON "pt_postal" ("code");
CREATE INDEX "coordinates" ON "pt_postal" ("geo_latitude", "geo_longitude");

I also have the following user defined function in PHP that returns the distance between two coordinates:

$db->sqliteCreateFunction('geo', function ()
{
    if (count($data = func_get_args()) < 4)
    {
        $data = explode(',', implode(',', $data));
    }

    if (count($data = array_map('deg2rad', array_filter($data, 'is_numeric'))) == 4)
    {
        return round(6378.14 * acos(sin($data[0]) * sin($data[2]) + cos($data[0]) * cos($data[2]) * cos($data[1] - $data[3])), 3);
    }

    return null;
});

Only 874 records have a distance from 38.73311, -9.138707 smaller or equal to 1 km.


The Problem

The UDF is working flawlessly in SQL queries, but for some reason I cannot use it's return value in WHERE clauses - for instance, if I execute the query:

SELECT
    "code",
    geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
    FROM "pt_postal" WHERE 1 = 1
        AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
        AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
        AND "distance" <= 1
    ORDER BY "distance" ASC
LIMIT 2048;

It returns 1035 records ordered by distance in ~0.05 seconds, however the last record has a "distance" of 1.353 km (which is bigger than the 1 km I defined as the maximum in the last WHERE).

If I drop the following clauses:

AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477

Now the query takes nearly 6 seconds and returns 2048 records (my LIMIT) ordered by distance. It's supposed take this long, but it should only return the 874 records that have "distance" <= 1.

The EXPLAIN QUERY PLAN for the original query returns:

SEARCH TABLE pt_postal USING INDEX coordinates (geo_latitude>? AND geo_latitude<?)
#(~7500 rows)
USE TEMP B-TREE FOR ORDER BY

And without the coordinate boundaries:

SCAN TABLE pt_postal
#(~500000 rows)
USE TEMP B-TREE FOR ORDER BY

What I Would Like to Do

I think I know why this is happening, SQLite is doing:

  1. use index coordinates to filter out the records outside of the boundaries in the WHERE clauses
  2. filter those records by the "distance" <= 1 WHERE clause, but distance is still NULL => 0!
  3. populate "code" and "distance" (by calling the UDF for the first time)
  4. order by the "distance" (which is populated by now)
  5. limit the records

What I would like SQLite to do:

  1. use index coordinates to filter out the records outside of the boundaries in the WHERE clauses
  2. for those records, populate code and distance by calling the UDF
  3. filter the records by the "distance" <= 1 WHERE clause
  4. order by the "distance" (without calling the UDF again)
  5. limit the records

Can anyone explain how I can make SQLite behave (if it's even possible) the way I want it to?


Postscript

Just out of curiosity, I tried to benchmark how much slower calling the UDF twice would be:

SELECT
    "code",
    geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
    FROM "pt_postal" WHERE 1 = 1
        AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
        AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
        AND geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") <= 1
    ORDER BY "distance" ASC
LIMIT 2048;

To my surprise, it still runs in the same ~0.06 seconds - and it still (wrongly!) returns the 1035 records.

Seems like the second geo() call is not even being evaluated... But it should, right?

回答1:

This also return 873 records, ordered by distance in ~0.04 seconds:

SELECT
    "code",
    geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
    FROM "pt_postal" WHERE 1 = 1
        AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
        AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
    GROUP BY "code"
        HAVING "distance" <= 1
    ORDER BY "distance" ASC
LIMIT 2048;

The reason this page doesn't have a GROUP BY clause is MySQL specific:

A HAVING clause can refer to any column or alias named in a select_expr in the SELECT list or in outer subqueries, and to aggregate functions. However, the SQL standard requires that HAVING must reference only columns in the GROUP BY clause or columns used in aggregate functions. To accommodate both standard SQL and the MySQL-specific behavior of being able to refer columns in the SELECT list, MySQL 5.0.2 and up permit HAVING to refer to columns in the SELECT list, columns in the GROUP BY clause, columns in outer subqueries, and to aggregate functions.


If no primary / unique key is available, the following hack also works (albeit a bit slower - ~0.16 seconds):

SELECT
    "code",
    geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
    FROM "pt_postal" WHERE 1 = 1
        AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
        AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
    GROUP BY _ROWID_
        HAVING "distance" <= 1
    ORDER BY "distance" ASC
LIMIT 2048;


回答2:

This query (provided by @OMGPonies):

SELECT *
    FROM (
        SELECT
            "code",
            geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
            FROM "pt_postal" WHERE 1 = 1
                AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
                AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
    )
        WHERE "distance" <= 1
    ORDER BY "distance" ASC
LIMIT 2048;

Correctly returns the 873 records, ordered by distance in ~0.07 seconds.

However, I'm still wondering why SQLite doesn't evaluate geo() in the WHERE clause, like MySQL...



回答3:

I cannot tell from the documentation whether or not sqliteCreateFunction defines an aggregate, like SUM, or a scalar, like sqrt. Aggregate functions cannot be referenced in a WHERE clause; HAVING is required.

Per the SQLite UDF documentation, you need to know if only xFunc is populated, or if xStep and xFinal are. Those are the pointers SQLite uses to know the kind of function you're defining, and thus whether or not to honor it in a WHERE clause.



回答4:

Basically, I was using sprintf() to see what kind of bounding coordinates where being computed, and since I couldn't run the query on any place other than PHP (because of the UDF) I was generating another query with prepared statements. The problem was, I wasn't generating the last bound parameter (the kilometers in the distance <= ? clause) and I was fooled by my sprintf() version.

Guess I shouldn't try to code when I'm sleepy. I'm truly sorry for your wasted time, and thank you all!


Just for the sake of completeness, the following returns (correctly!) 873 records, in ~ 0.04 seconds:

SELECT "code",
    geo(38.73311, -9.138707, "geo_latitude", "geo_longitude") AS "distance"
    FROM "pt_postal" WHERE 1 = 1
        AND "geo_latitude" BETWEEN 38.7241268076 AND 38.7420931924
        AND "geo_longitude" BETWEEN -9.15022289523 AND -9.12719110477
        AND "distance" <= 1
    ORDER BY "distance" ASC
LIMIT 2048;