LATERAL JOIN not using trigram index

2019-04-06 15:42发布

问题:

I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

I also have a table of location data:

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

That gave the expected result, but it was slow. Here's the query plan:

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

I tried adding a gist trigram index to the address column, like mentioned at https://stackoverflow.com/a/13452528/36191, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER or LIMIT:

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?

回答1:

Why?

The query cannot use the index on principal. You would need an index on the table locations, but the one you have is on the table addresses.

You can verify my claim by setting:

SET enable_seqscan = off;

(In your session only, and for debugging only. Never use it in production.) It's not like the index would be more expensive than a sequential scan, there is just no way for Postgres to use it for your query at all.

Aside: [INNER] JOIN ... ON true is just an awkward way of saying CROSS JOIN ...

Why is the index used after removing ORDER and LIMIT?

Because Postgres can rewrite this simple form to:

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

You'll see the exact same query plan. (At least I do in my tests on Postgres 9.5.)

Solution

You need an index on locations.postalcode. And while using LIKE or ILIKE you would also need to bring the indexed expression (postalcode) to the left side of the operator. ILIKE is implemented with the operator ~~* and this operator has no COMMUTATOR (a logical necessity), so it's not possible to flip operands around. Detailed explanation in these related answers:

  • Can PostgreSQL index array columns?
  • PostgreSQL - text Array contains value similar to
  • Is there a way to usefully index a text column containing regex patterns?

A solution is to use the trigram similarity operator % or its inverse, the distance operator <-> in a nearest neighbour query instead (each is commutator for itself, so operands can switch places freely):

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <-> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

Find the most similar postalcode for each address, and then check if that postalcode actually matches fully.

This way, a longer postalcode will be preferred automatically since it's more similar (smaller distance) than a shorter postalcode that also matches.

A bit of uncertainty remains. Depending on possible postal codes, there could be false positives due to matching trigrams in other parts of the string. There is not enough information in the question to say more.

Here, [INNER] JOIN instead of CROSS JOIN makes sense, since we add an actual join condition.

The manual:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

So:

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);


回答2:

It's a far shot, but how does the following alternative perform?

SELECT DISTINCT ON ((x.a).address) (x.a).*, l.*
FROM (
  SELECT a, l.id AS lid, LENGTH(l.postalcode) AS pclen
  FROM addresses a
  LEFT JOIN locations l ON (a.address ilike '%' || l.postalcode || '%') -- this should be fast, but produce many rows
  ) x
LEFT JOIN locations l ON (l.id = x.lid)
ORDER BY (x.a).address, pclen DESC -- this is where it will be slow, as it'll have to sort the entire results, to filter them by DISTINCT ON


回答3:

It can work if you turn the lateral join inside out. But even then it might still be really slow

SELECT DISTINCT ON (address) *
FROM (
    SELECT * 
    FROM locations
       ,LATERAL(
           SELECT * FROM addresses
           WHERE address ilike '%' || postalcode || '%'
           OFFSET 0 -- force fencing, might be redundant
        ) a
) q
ORDER BY address, LENGTH(postalcode) DESC

The downside is that you can implement paging only on postalcodes, not addresses.