Redshift UPDATE prohibitively slow

2019-04-09 02:28发布

问题:

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:

SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
  AND col2 = 12;

The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:

UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
  AND col2 = 12;

This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.

I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.

Any idea what is going on here?

回答1:

You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:

  1. Each row that will be changed must be first marked for deletion
  2. Then a new version of the data must be written for each column in the table

If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.

You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.



回答2:

Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.

In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.