A few weeks ago I started working with Postgres and I've been trying to solve this problem for the last few days with mixed results: I have a table (10 million rows) that needs to be updated once every month with information stored in a few other tables. During this update no-one has access to the database, so no-one is reading / writing data. I will be the only DB user during that time.
Table A has 10 million rows. One column (rebuilding costs) needs to be updated. Table B, table C and table D have newly calculated rebuilding costs for every row in Table A (so table B, C and D each have 10 million rows). Which table is used for updating Table A depends on the value, see below. I've been trying to do this update with this query:
UPDATE table_A a
SET rebuilding_costs =
CASE
WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs
WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs
WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs
ELSE NULL
END
FROM table_B b
LEFT OUTER JOIN table_C c
ON (b.address = c.address)
LEFT OUTER JOIN table_D d
ON (b.address = d.address)
WHERE a.address = b.address
;
This query gives the right result, but it is a little slow (25 minutes). The funny thing is: when table A has 1 million rows (instead of 10 million) it only takes 30 seconds. So when running this query on a table with 10 million rows I expected the query to run for 5 minutes, but it took 25 minutes instead. Thats when I tried to UPDATE in Blocks, so I added this line to the query in the WHERE clause:
AND (a.id > 0 AND a.id < 1000000)
a.id is the primary key of table A. In this example only the first million rows of table A get updated. However, it took 3 minutes to complete. You would have to do it 10 times to update all 10 million rows so that would be 30 minutes..
I also tried to prevent the query from doing someone with rows that are not going to be changed by the UPDATE by adding this:
AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud
Setting the fillfactor to 70 and 50 did help to improve the speed a little bit, but I couldnt get it under 20 minutes.
I also tried to recreate table A, like:
CREATE TABLE table_A_new
AS
SELECT a.address,
CASE
WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs
WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs
WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs
ELSE NULL
END rebuildingcosts
FROM table_A a
LEFT OUTER JOIN table_B b
ON (a.address = b.address)
LEFT OUTER JOIN table_C c
ON (a.address = c.address)
LEFT OUTER JOIN table_D d
ON (a.address = d.address)
;
This is very fast (2 minutes) and gives the right result. However, you have to recreate an entire table. Somehow that doesnt seem very efficient (and takes a lot of storage space), that's why I started out with UPDATE query's.
My question is: what is the best way to proceed? Is there a way to improve the performance of the UPDATE query, or would it be best to look for an alternative such as the 'create table' one in the example above.
I have to use Postrgres though, I cant switch to a different DBMS.
This is the execution plan of the UPDATE query (right now the tables don't have 10 million rows, but approximately 6 - 8 million rows):
Update on tabel_A a (cost=902288.27..2150690.80 rows=6714762 width=65)
-> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65)
Hash Cond: ((b.adres)::text = a.adres)"
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)
AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
-> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39)
Hash Cond: ((b.adres)::text = (d.adres)::text)
-> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29)
Hash Cond: ((c.adres)::text = (b.adres)::text)
-> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19)
-> Hash (cost=172605.71..172605.71 rows=6716471 width=19)
-> Seq Scan on tabel_B b (cost=0.00..172605.71 rows=6716471 width=19)
-> Hash (cost=103436.52..103436.52 rows=6709052 width=19)"
-> Seq Scan on tabel D d (cost=0.00..103436.52 rows=6709052 width=19)"
-> Hash (cost=217261.00..217261.00 rows=8000000 width=39)"
-> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"
This is the Explain analyse output:
Update on Tabel_A a (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1)
-> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=108861.999..214888.780 rows=5252864 loops=1)
Hash Cond: ((b.adres)::text = a.adres)
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND
(c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
(d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
Rows Removed by Join Filter: 670998
-> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)"
Hash Cond: ((b.adres)::text = (d.adres)::text)"
-> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)"
Hash Cond: ((c.adres)::text = (b.adres)::text)"
-> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)"
-> Hash (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)"
Buckets: 2048 Batches: 512 Memory Usage: 678kB"
-> Seq Scan on Tabel_B b (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)"
-> Hash (cost=103436.52..103436.52 rows=6709052 width=19) (actual time=8566.848..8566.848 rows=6709229 loops=1)"
Buckets: 2048 Batches: 512 Memory Usage: 678kB"
-> Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)"
-> Hash (cost=217261.00..217261.00 rows=8000000 width=39) (actual time=65721.815..65721.815 rows=8000000 loops=1)"
Buckets: 2048 Batches: 1024 Memory Usage: 612kB"
-> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)"
Total runtime: 2815453.549 ms"
Table A, B, C and D have all indexes on the address column, the column that is being used to join on. Te tables have no other indexes. Table A does have a primary key (id). Table B, C and D have between 5 and 7 columns extra that are not being used in this process.