PostgreSQL Joining Between Two Values

2020-02-15 07:07发布

问题:

I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.

create table counties (
  zip_code_from  char(5) not null,
  zip_code_thru  char(5) not null,
  county_code    char(3) not null
);

create table cities (
  city      text    not null,
  zip_code  char(5) not null
);

My first approach was using a "between" in the join:

select
  ci.city, ci.zip_code, co.county_code
from
  cities ci
  join counties co on
    co.zip_code between ci.zip_code_from and ci.zip_code_thru

I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?

Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.

My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:

select
  generate_series (
    cast (zip_code_from as int),
    cast (zip_code_thru as int)
  ) as zip_code,
  *
from counties

This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.

I'm guessing that even doing that on the fly and not having indexes would be preferable to the between in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.

I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.

回答1:

Months later, this has cropped its head again, and I decided to test some of my theories.

The original query:

select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join counties co on
    ci.zip_code between co.from_zip_code and co.thru_zip_code

Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.

If I "pre-explode" the zip code ranges into discrete records:

with exploded_zip as (
  select
    generate_series (
      cast (from_zip_code as int),
      cast (thru_zip_code as int)
    )::text as zip_code,
    *
  from counties
)
select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join exploded_zip co on
    ci.zip_code = co.zip_code

The query returns the exact same rows but finishes in 2.8 seconds.

So it seems the bottom line is that using a between in a join (or any inequality) is a really bad idea.