I've got this table (generated by Django):
CREATE TABLE feeds_person (
id serial PRIMARY KEY,
created timestamp with time zone NOT NULL,
modified timestamp with time zone NOT NULL,
name character varying(4000) NOT NULL,
url character varying(1000) NOT NULL,
email character varying(254) NOT NULL,
CONSTRAINT feeds_person_name_ad8c7469_uniq UNIQUE (name, url, email)
);
I'm trying to bulk insert a lot of data using INSERT
with an ON CONFLICT
clause.
The wrinkle is that I need to get the id
back for all of the rows, whether they're already existing or not.
In other cases, I would do something like:
INSERT INTO feeds_person (created, modified, name, url, email)
VALUES blah blah blah
ON CONFLICT (name, url, email) DO UPDATE SET url = feeds_person.url
RETURNING id
Doing the UPDATE
causes the statement to return the id
of that row. Except, it doesn't work with this table. I think it doesn't work because I've got multiple fields unique together whereas in other instances I've used this method I've had just one unique field.
I get this error when trying to run the SQL through Django's cursor:
django.db.utils.ProgrammingError: ON CONFLICT DO UPDATE command cannot affect row a second time HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
How do I do the bulk insert with this table and get back the inserted and existing ids?
The error you get:
indicates you are trying to upsert the same row more than once in a single command. In other words: you have dupes on
(name, url, email)
in yourVALUES
list. Fold duplicates (if that's an option) and it should work. But you will have to decide which row to pick from each set of dupes.Since we use a free-standing
VALUES
expression now, you have to add explicit type casts for non-default types. Like:Your
timestamptz
columns need an explicit type cast, while the string types can operate with defaulttext
. (You could still cast tovarchar(n)
right away.)There are ways to determine which row to pick from each set of dupes:
You are right, there is (currently) no way to get excluded rows in the
RETURNING
clause. I quote the Postgres Wiki:However, you shouldn't be updating rows that are not supposed to be updated. Empty updates are almost as expensive as regular updates - and might have unintended side effects. You don't strictly need UPSERT to begin with, your case looks more like "SELECT or INSERT". Related:
One cleaner way to insert a set of rows would be with data-modifying CTEs:
The added complexity should pay for big tables where
INSERT
is the rule andSELECT
the exception.Originally, I had added a
NOT EXISTS
predicate on the lastSELECT
to prevent duplicates in the result. But that was redundant. All CTEs of a single query see the same snapshots of tables. The set returned withON CONFLICT (name, url, email) DO NOTHING
is mutually exclusive to the set returned after theINNER JOIN
on the same columns.Unfortunately this also opens a tiny window for a race condition. If ...
... some rows may be lost.
You might just
INSERT .. ON CONFLICT DO NOTHING
, followed by a separateSELECT
query for all rows - within the same transaction to overcome this. Which in turn opens another tiny window for a race condition if concurrent transactions can commit writes to the table betweenINSERT
andSELECT
(in defaultREAD COMMITTED
isolation level). Can be avoided withREPEATABLE READ
transaction isolation (or stricter). Or with a (possibly expensive or even unacceptable) write lock on the whole table. You can get any behavior you need, but there may be a price to pay.Related: