I wrote a function to create posts for a simple blogging engine:
CREATE FUNCTION CreatePost(VARCHAR, TEXT, VARCHAR[])
RETURNS INTEGER AS $$
DECLARE
InsertedPostId INTEGER;
TagName VARCHAR;
BEGIN
INSERT INTO Posts (Title, Body)
VALUES ($1, $2)
RETURNING Id INTO InsertedPostId;
FOREACH TagName IN ARRAY $3 LOOP
DECLARE
InsertedTagId INTEGER;
BEGIN
-- I am concerned about this part.
BEGIN
INSERT INTO Tags (Name)
VALUES (TagName)
RETURNING Id INTO InsertedTagId;
EXCEPTION WHEN UNIQUE_VIOLATION THEN
SELECT INTO InsertedTagId Id
FROM Tags
WHERE Name = TagName
FETCH FIRST ROW ONLY;
END;
INSERT INTO Taggings (PostId, TagId)
VALUES (InsertedPostId, InsertedTagId);
END;
END LOOP;
RETURN InsertedPostId;
END;
$$ LANGUAGE 'plpgsql';
Is this prone to race conditions when multiple users delete tags and create posts at the same time?
Specifically, do transactions (and thus functions) prevent such race conditions from happening?
I'm using PostgreSQL 9.2.3.
I think there is a slight chance that when the tag already existed it might be deleted by another transaction after your transaction has found it. Using a SELECT FOR UPDATE should solve that.
There's still something to watch out for even when using the
ON CONFLICT
clause introduced in Postgres 9.5. Using the same function and example table as in @Erwin Brandstetter's answer, if we do:So
f_tag_id
returnedNULL
in session 2, which would be impossible in a single-threaded world!If we raise the transaction isolation level to
repeatable read
(or the strongerserializable
), session 2 throwsERROR: could not serialize access due to concurrent update
instead. So no "impossible" results at least, but unfortunately we now need to be prepared to retry the transaction.Edit: With
repeatable read
orserializable
, if session 1 inserts taga
, then session 2 insertsb
, then session 1 tries to insertb
and session 2 tries to inserta
, one session detects a deadlock:After the session that received the deadlock error rolls back, the other session continues. So I guess we should treat deadlock just like
serialization_failure
and retry, in a situation like this?Alternatively, insert the tags in a consistent order, but this is not easy if they don't all get added in one place.
It's the recurring problem of
SELECT
orINSERT
under possible concurrent write load, related to (but different from)UPSERT
(which isINSERT
orUPDATE
).For Postgres 9.5 or later
Using the new UPSERT implementation
INSERT ... ON CONFLICT .. DO UPDATE
, we can largely simplify. PL/pgSQL function toINSERT
orSELECT
a single row (tag):There is still a tiny window for a race condition. To make absolutely sure you get an ID:
This keeps looping until either
INSERT
orSELECT
succeeds. Call:If subsequent commands in the same transaction rely on the existence of the row and it is actually possible that other transactions update or delete it concurrently, you can lock an existing row in the
SELECT
statement withFOR SHARE
.If the row gets inserted instead, it is locked until the end of the transaction anyway.
If a new row is inserted most of the time, start with
INSERT
to make it faster.Related:
Related (pure SQL) solution to
INSERT
orSELECT
multiple rows (a set) at once:What's wrong with this pure SQL solution?
I had previously also suggested this SQL function:
Which isn't entirely wrong, but it fails to seal a loophole, like @FunctorSalad worked out in his added answer. The function can come up with an empty result if a concurrent transaction tries to do the same at the same time. All statements in a query with CTEs are virtually executed at the same time. The manual:
If a concurrent transaction inserts the same new tag a moment earlier, but hasn't committed, yet:
The UPSERT part comes up empty, after waiting for the concurrent transaction to finish. (If the concurrent transaction should roll back, it still inserts the new tag and returns a new ID.)
The SELECT part also comes up empty, because it's based on the same snapshot, where the new tag from the (yet uncommitted) concurrent transaction is not visible.
We get nothing. Not as intended. That's counter-intuitive to naive logic (and I got caught there), but that's how the MVCC model of Postgres works - has to work.
So do not use this if multiple transactions can try to insert the same tag at the same time. Or loop until you actually get a row. The loop will hardly ever be triggered in common work loads.
Original answer (Postgres 9.4 or older)
Given this (slightly simplified) table:
... a practically 100% secure function to insert new tag / select existing one, could look like this.
Why not 100%? Consider the notes in the manual for the related
UPSERT
example:SQL Fiddle.
Explanation
Try the
SELECT
first. This way you avoid the considerably more expensive exception handling 99.99% of the time.Use a CTE to minimize the (already tiny) time slot for the race condition.
The time window between the
SELECT
and theINSERT
within one query is super tiny. If you don't have heavy concurrent load, or if you can live with an exception once a year, you could just ignore the case and use the SQL statement, which is faster.No need for
FETCH FIRST ROW ONLY
(=LIMIT 1
). The tag name is obviouslyUNIQUE
.Remove
FOR SHARE
in my example if you don't usually have concurrentDELETE
orUPDATE
on the tabletag
. Costs a tiny bit of performance.Never quote the language name:
'plpgsql'.plpgsql
is an identifier. Quoting may cause problems and is only tolerated for backwards compatibility.Don't use non-descriptive column names like
id
orname
. When joining a couple of tables (which is what you do in a relational DB) you end up with multiple identical names and have to use aliases.Built into your function
Using this function you could largely simplify your
FOREACH LOOP
to:Faster, though, as a single SQL statement with
unnest()
:Replaces the whole loop.
Alternative solution
This variant builds on the behavior of
UNION ALL
with aLIMIT
clause: as soon as enough rows are found, the rest is never executed:Building on this, we can outsource the
INSERT
into a separate function. Only there we need exception handling. Just as safe as the first solution.Which is used in the main function:
This is a bit cheaper if most of the calls only need
SELECT
, because the more expensive block withINSERT
containing theEXCEPTION
clause is rarely entered. The query is also simpler.FOR SHARE
is not possible here (not allowed inUNION
query).LIMIT 1
would not be necessary (tested in pg 9.4). Postgres derivesLIMIT 1
fromINTO _tag_id
and only executes until the first row is found.