A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE
and the standard supports as part of the MERGE
operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe')
, (3, 'Alan')
, so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert
. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".
Tested on Postgresql 9.3
9.5 and newer:
PostgreSQL 9.5 and newer support
INSERT ... ON CONFLICT UPDATE
(andON CONFLICT DO NOTHING
), i.e. upsert.Comparison with
ON DUPLICATE KEY UPDATE
.Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in
UPSERT
(orMERGE
) facility, and doing it efficiently in the face of concurrent use is very difficult.This article discusses the problem in useful detail.
In general you must choose between two options:
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in
READ COMMITTED
mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully
INSERT
s; the other gets a duplicate key error and does anUPDATE
instead. TheUPDATE
blocks waiting for theINSERT
to rollback or commit. When it rolls back, theUPDATE
condition re-check matches zero rows, so even though theUPDATE
commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both
UPDATE
at the same time, matching zero rows and continuing. Then they both do theEXISTS
test, which happens before theINSERT
. Both get zero rows, so both do theINSERT
. One fails with a duplicate key error.This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE
aTEMPORARY
tableCOPY
or bulk-insert the new data into the temp tableLOCK
the target tableIN EXCLUSIVE MODE
. This permits other transactions toSELECT
, but not make any changes to the table.Do an
UPDATE ... FROM
of existing records using the values in the temp table;Do an
INSERT
of rows that don't already exist in the target table;COMMIT
, releasing the lock.For example, for the example given in the question, using multi-valued
INSERT
to populate the temp table:Related reading
MERGE
on the PostgreSQL wikiWhat about
MERGE
?SQL-standard
MERGE
actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use
MERGE
for upserts, but it's actually wrong.Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE
in MySQLMERGE
from MS SQL Server (but see above aboutMERGE
problems)MERGE
from Oracle (but see above aboutMERGE
problems)SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports
ON CONFLICT
now with two methodson_conflict_do_update()
andon_conflict_do_nothing()
:Copying from the documentation:
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert
I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.
Here are some examples for
insert ... on conflict ...
(pg 9.5+) :Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3) on conflict do nothing;
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3) on conflict(id) do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3) on conflict on constraint dummy_pkey do update set name = 'new_name', size = 4;
Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
Now a couple helper functions
And finally the upsert function
Here's how you use it
The advantage this has over
bulk_save_objects
is that it can handle relationships, error checking, etc on insert (unlike bulk operations).