Doing a big update using multiple tables in Postgr

2019-08-14 08:28发布

问题:

A few weeks ago I started working with Postgres and I've been trying to solve this problem for the last few days with mixed results: I have a table (10 million rows) that needs to be updated once every month with information stored in a few other tables. During this update no-one has access to the database, so no-one is reading / writing data. I will be the only DB user during that time.

Table A has 10 million rows. One column (rebuilding costs) needs to be updated. Table B, table C and table D have newly calculated rebuilding costs for every row in Table A (so table B, C and D each have 10 million rows). Which table is used for updating Table A depends on the value, see below. I've been trying to do this update with this query:

UPDATE table_A a
SET rebuilding_costs = 
    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END
FROM table_B b
LEFT OUTER JOIN table_C c
    ON (b.address = c.address)
LEFT OUTER JOIN table_D d
    ON (b.address = d.address)
WHERE a.address = b.address
;

This query gives the right result, but it is a little slow (25 minutes). The funny thing is: when table A has 1 million rows (instead of 10 million) it only takes 30 seconds. So when running this query on a table with 10 million rows I expected the query to run for 5 minutes, but it took 25 minutes instead. Thats when I tried to UPDATE in Blocks, so I added this line to the query in the WHERE clause:

AND (a.id > 0 AND a.id < 1000000)

a.id is the primary key of table A. In this example only the first million rows of table A get updated. However, it took 3 minutes to complete. You would have to do it 10 times to update all 10 million rows so that would be 30 minutes..

I also tried to prevent the query from doing someone with rows that are not going to be changed by the UPDATE by adding this:

AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud

Setting the fillfactor to 70 and 50 did help to improve the speed a little bit, but I couldnt get it under 20 minutes.

I also tried to recreate table A, like:

CREATE TABLE table_A_new

AS

SELECT a.address, 

    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END rebuildingcosts

FROM table_A a
LEFT OUTER JOIN table_B b
    ON (a.address = b.address)
LEFT OUTER JOIN table_C c
    ON (a.address = c.address)
LEFT OUTER JOIN table_D d
    ON (a.address = d.address)
;

This is very fast (2 minutes) and gives the right result. However, you have to recreate an entire table. Somehow that doesnt seem very efficient (and takes a lot of storage space), that's why I started out with UPDATE query's.

My question is: what is the best way to proceed? Is there a way to improve the performance of the UPDATE query, or would it be best to look for an alternative such as the 'create table' one in the example above.

I have to use Postrgres though, I cant switch to a different DBMS.

This is the execution plan of the UPDATE query (right now the tables don't have 10 million rows, but approximately 6 - 8 million rows):

Update on tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65)
->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65)
      Hash Cond: ((b.adres)::text = a.adres)"
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) 
AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
    ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39)
          Hash Cond: ((b.adres)::text = (d.adres)::text)
          ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29)
                Hash Cond: ((c.adres)::text = (b.adres)::text)
                ->  Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19)
                ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19)
                      ->  Seq Scan on tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19)
          ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19)"
                ->  Seq Scan on tabel D d  (cost=0.00..103436.52 rows=6709052 width=19)"
     ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39)"
          ->  Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"

This is the Explain analyse output:

Update on Tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1)
  ->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual      time=108861.999..214888.780 rows=5252864 loops=1)
        Hash Cond: ((b.adres)::text = a.adres)
        Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND  
  (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
  (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
         Rows Removed by Join Filter: 670998
        ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)"
              Hash Cond: ((b.adres)::text = (d.adres)::text)"
              ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)"
                    Hash Cond: ((c.adres)::text = (b.adres)::text)"
                    ->  Seq Scan on Tabel_C c  (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)"
                    ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)"
                          Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                          ->  Seq Scan on Tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)"
              ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19) (actual     time=8566.848..8566.848 rows=6709229 loops=1)"
                    Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                    ->  Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)"
        ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39) (actual      time=65721.815..65721.815 rows=8000000 loops=1)"
               Buckets: 2048  Batches: 1024  Memory Usage: 612kB"
              ->  Seq Scan on Tabel_A a  (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)"
Total runtime: 2815453.549 ms"

Table A, B, C and D have all indexes on the address column, the column that is being used to join on. Te tables have no other indexes. Table A does have a primary key (id). Table B, C and D have between 5 and 7 columns extra that are not being used in this process.

回答1:

This query should produce the same results, but avoids the CASE

-- EXPLAIN ANALYZE
UPDATE table_a a
SET rebuilding_costs = drie.rebuilding_costs
FROM (
        SELECT COALESCE(b.address, c.address, d.address) AS address
        , COALESCE(b.rebuilding_costs, c.rebuilding_costs,  d.rebuilding_costs)
            AS rebuilding_costs
        FROM table_b b
        FULL OUTER JOIN table_c c ON c.address = b.address
                AND  c.rebuilding_costs BETWEEN 200001 AND 400000
        FULL OUTER JOIN table_D d ON d.address = b.address
                AND  d.rebuilding_costs BETWEEN 400001 AND 600000
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        ) drie
WHERE a.address = drie.address
AND a.rebuilding_costs <> drie.rebuilding_costs -- Avoid useless updates
        ;

UPDATE: A similar approach, based on a chained set of CTEs:

-- --------------------------------
EXPLAIN ANALYZE
WITH cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        )
,       cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        )
,       cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        )
,       cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        )
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
-- avoid useless updates this way:
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs
-- ,or this way:
-- AND cte_bcd.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs
   ;

UPDATE2: CTE's can be slow, because they act as optimisation barriers. A quick&dirty way is to rewrite tem as (temporary) VIEWs, and refer to these instead. This allows the optimiser to shuffle the various parts of the query into and out off subqueries, and even combine and reuse these.

CREATE TEMP VIEW cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        );
CREATE TEMP VIEW        cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        );
CREATE TEMP VIEW        cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        );
CREATE TEMP VIEW        cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        );
EXPLAIN -- ANALYZE
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- avoid useless updates
-- AND a.address < 100000
        ;