Slow query on “UNION ALL” view

2019-03-25 00:18发布

问题:

I have a DB view which basically consists of two SELECT queries with UNION ALL, like this:

CREATE VIEW v AS
SELECT time, etc. FROM t1 // #1...
UNION ALL
SELECT time, etc. FROM t2 // #2...

The problem is that selects of the form

SELECT ... FROM v WHERE time >= ... AND time < ...

perform really really slow on it.

Both SELECT #1 and #2 are decently fast, properly indexed and so on: when I create views v1 and v2 like:

CREATE VIEW v1 AS
SELECT time, etc. FROM t1 // #1...

CREATE VIEW v2 AS
SELECT time, etc. FROM t2 // #2...

And the same SELECT, with same WHERE condition as the above works OK on them individually.

Any ideas about where might be the problem and how to solve it?

(Just to mention, it's one of the recent Postgres versions.)

Edit: Adding anonymized query plans (thaks to @filiprem for the link to an awesome tool):

v1:

Aggregate  (cost=9825.510..9825.520 rows=1 width=53) (actual time=59.995..59.995 rows=1 loops=1)
  ->  Index Scan using delta on echo alpha  (cost=0.000..9815.880 rows=3850 width=53) (actual time=0.039..53.418 rows=33122 loops=1)
          Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey))
          Filter: ((NOT victor) AND ((bravo_sierra five NULL) OR ((bravo_sierra)::golf <> 'india'::golf)))

v2:

Aggregate  (cost=15.470..15.480 rows=1 width=33) (actual time=0.231..0.231 rows=1 loops=1)
  ->  Index Scan using yankee on six charlie  (cost=0.000..15.220 rows=99 width=33) (actual time=0.035..0.186 rows=140 loops=1)
          Index Cond: (("juliet" >= 'seven'::uniform bravo oscar whiskey) AND ("juliet" <= 'november'::uniform bravo oscar whiskey))
          Filter: (NOT victor)

v:

Aggregate  (cost=47181.850..47181.860 rows=1 width=0) (actual time=37317.291..37317.291 rows=1 loops=1)
  ->  Append  (cost=42.170..47132.480 rows=3949 width=97) (actual time=1.277..37304.453 rows=33262 loops=1)
        ->  Nested Loop Left Join  (cost=42.170..47052.250 rows=3850 width=99) (actual time=1.275..37288.465 rows=33122 loops=1)
              ->  Hash Left Join  (cost=42.170..9910.990 rows=3850 width=115) (actual time=1.123..117.797 rows=33122 loops=1)
                      Hash Cond: ((alpha_seven.two)::golf = (quebec_three.two)::golf)
                    ->  Index Scan using delta on echo alpha_seven  (cost=0.000..9815.880 rows=3850 width=132) (actual time=0.038..77.866 rows=33122 loops=1)
                            Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey_two) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey_two))
                            Filter: ((NOT victor) AND ((bravo_sierra five NULL) OR ((bravo_sierra)::golf <> 'india'::golf)))
                    ->  Hash  (cost=30.410..30.410 rows=941 width=49) (actual time=1.068..1.068 rows=941 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 75kB
                          ->  Seq Scan on alpha_india quebec_three  (cost=0.000..30.410 rows=941 width=49) (actual time=0.010..0.486 rows=941 loops=1)
              ->  Index Scan using mike on hotel quebec_sierra  (cost=0.000..9.630 rows=1 width=24) (actual time=1.112..1.119 rows=1 loops=33122)
                      Index Cond: ((alpha_seven.zulu)::golf = (quebec_sierra.zulu)::golf)
        ->  Subquery Scan on "*SELECT* 2"  (cost=34.080..41.730 rows=99 width=38) (actual time=1.081..1.951 rows=140 loops=1)
              ->  Merge Right Join  (cost=34.080..40.740 rows=99 width=38) (actual time=1.080..1.872 rows=140 loops=1)
                      Merge Cond: ((quebec_three.two)::golf = (charlie.two)::golf)
                    ->  Index Scan using whiskey_golf on alpha_india quebec_three  (cost=0.000..174.220 rows=941 width=49) (actual time=0.017..0.122 rows=105 loops=1)
                    ->  Sort  (cost=18.500..18.750 rows=99 width=55) (actual time=0.915..0.952 rows=140 loops=1)
                            Sort Key: charlie.two
                            Sort Method:  quicksort  Memory: 44kB
                          ->  Index Scan using yankee on six charlie  (cost=0.000..15.220 rows=99 width=55) (actual time=0.022..0.175 rows=140 loops=1)
                                  Index Cond: (("juliet" >= 'seven'::uniform bravo_victor oscar whiskey_two) AND ("juliet" <= 'november'::uniform bravo_victor oscar whiskey_two))
                                  Filter: (NOT victor)

juliet is time.

回答1:

This seems to be a case of a pilot error. The "v" query plan selects from at least 5 different tables.

Now, Are You sure You are connected to the right database? Maybe there are some funky search_path settings? Maybe t1 and t2 are actually views (possibly in a different schema)? Maybe You are somehow selecting from the wrong view?

Edited after clarification:

You are using a quite new feature called "join removal" : http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.0#Join_Removal

http://rhaas.blogspot.com/2010/06/why-join-removal-is-cool.html

It appears that the feature does not kick in when union all is involved. You probably have to rewrite the view using only the required two tables.

another edit: You appear to be using an aggregate (like "select count(*) from v" vs. "select * from v"), which could get vastly different plans in face of join removal. I guess we won't get very far without You posting the actual queries, view and table definitions and plans used...



回答2:

I believe your query is being executed similar to:

(
   ( SELECT time, etc. FROM t1 // #1... )
   UNION ALL
   ( SELECT time, etc. FROM t2 // #2... )
)
WHERE time >= ... AND time < ...

which the optimizer is having difficulty optimizing. i.e. it's doing the UNION ALL first before applying the WHERE clause but, you wish it to apply the WHERE clause before the UNION ALL.

Couldn't you put your WHERE clause in the CREATE VIEW?

CREATE VIEW v AS
( SELECT time, etc. FROM t1  WHERE time >= ... AND time < ... )
UNION ALL
( SELECT time, etc. FROM t2  WHERE time >= ... AND time < ... )

Alternatively if the view cannot have the WHERE clause, then, perhaps you can keep to the two views and do the UNION ALL with the WHERE clause when you need them:

CREATE VIEW v1 AS
SELECT time, etc. FROM t1 // #1...

CREATE VIEW v2 AS
SELECT time, etc. FROM t2 // #2...

( SELECT * FROM v1 WHERE time >= ... AND time < ... )
UNION ALL
( SELECT * FROM v2 WHERE time >= ... AND time < ... )


回答3:

I do not know Postgres, but some RMDBs handle comparison operators worse than BETWEEN in case of indexes. I would make an attempt using BETWEEN.

SELECT ... FROM v WHERE time BETWEEN ... AND ...


回答4:

A possibility would be to issue a new SQL dynamically at each call instead of creating a view and to integrate the where clause in each SELECT of the union query

SELECT time, etc. FROM t1
    WHERE time >= ... AND time < ...
UNION ALL
SELECT time, etc. FROM t2
    WHERE time >= ... AND time < ...

EDIT:

Can you use a parametrized function?

CREATE OR REPLACE FUNCTION CallMyView(t1 date, t2 date)
RETURNS TABLE(d date, etc.)
AS $$
    BEGIN
        RETURN QUERY
            SELECT time, etc. FROM t1
                WHERE time >= t1 AND time < t2
            UNION ALL
            SELECT time, etc. FROM t2
                WHERE time >= t1 AND time < t2;
    END;
$$ LANGUAGE plpgsql;

Call

SELECT * FROM CallMyView(..., ...);


回答5:

Combine the two tables. Add a column to indicate original table. If necessary, replace the original table names with views that select just the relevant part. Problem solved!

Looking into the superclass/subclass db design pattern could be of use to you.



回答6:

Try creating your view using UNION DISTINCT instead of UNION ALL. See if it gives wrong results. See if it gives faster performance.

If it gives wrong results, try and map your SQL operations on tables back to relational operations on relations. The elements of relations are always distinct. There may be somthing fundamentally wrong with your model.

I am deeply suspicious of the LEFT JOINS in the query plan you showed. It shouldn't be necessary to perform LEFT JOINS in order to get the results you appear to be selecting.



回答7:

Encountered same scenario on 11g:

Scenario 1:

CREATE VIEW v AS
  SELECT time, etc. FROM t1 // #1...

The following query runs fast, plan looks okay:

SELECT ... FROM v WHERE time >= ... AND time < ...

Scenario 2:

CREATE VIEW v AS
  SELECT time, etc. FROM t2 // #2...

The following query runs fast, plan looks okay:

SELECT ... FROM v WHERE time >= ... AND time < ...

Scenario 3, with UNION ALL:

CREATE VIEW v AS
  SELECT time, etc. FROM t1 // #1...
  UNION ALL
  SELECT time, etc. FROM t2 // #2...

The following runs slow. Plan breaks apart t1 and t2 (which were also views) and assembles them as a big series of unions. The time filters are being applied properly on the individual components, but it is still very slow:

SELECT ... FROM v WHERE time >= ... AND time < ...

I would have been happy to just get a time in the ballpark of t1 plus t2, but it was more than double. Adding the parallel hint did the trick for me in this case. It re-arranged everything into a better plan:

SELECT /*+ parallel */ ... FROM v WHERE time >= ... AND time < ...


回答8:

I think i don't have much points to post it as comments so i am posting it as an answer

I don't know how PostgreSQL works behind the scene, i think you may get a clue if it would have been Oracle, so it is here how Oracle would work

Your UNION ALL view is slower because, behind the scene, records from both SELECT #1 and #2 are combined in a temporary table first, which is created on the fly, and then your SELECT ... FROM v WHERE time >= ... AND time < ... is executed on this temporary table. Since both #1 and #2 are indexed so they are working faster individually as expected, but this temporary table is not indexed (of course) and the final records are being selected from this temporary table so resulting in a slower response.

Now, at least, i don't see any way to have it faster + view + non-materialized

One way, other than running SELECT #1 and #2 and UNION them explicitly, to make it faster would be to use a stored procedure or a function in your application programming language (if it is the case), and in this procedure you make separate calls to each indexed table and then combine results, which is not as simple as SELECT ... FROM v WHERE time >= ... AND time < ... :(