Why is an implicit table lock being released prior

2019-05-07 00:46发布

问题:

I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:

  1. Begins transaction
  2. Creates a table staging_foo like foo
  3. Copies data from external source into staging_foo
  4. Performs mass insert/update/delete on foo so that it matches staging_foo
  5. Drop staging_foo
  6. Commit transaction

Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.

What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.

I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:

Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :

ALTER TABLE (to add or drop columns)

CREATE TABLE

DROP TABLE

TRUNCATE TABLE

The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.

I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?

Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.

回答1:

OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.

So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.