Perhaps it's my naiveté, perhaps my paranoia, but I think I'm looking for a solution to a race-condition issue that seems like it should be so common there'd be floods of solutions and I'd've found one by now... but I haven't.
The simplistic scenario is I have a process that's supposed to grab any records where there are more than one of a certain type. I'd like to make the system/process(es) thread-/multiprocessing-/reentrant-/buzzword-of-the-day-safe; if the same process gets started and introduces a race condition trying to grab rows of interest, I'd like there to be clear-cut winners/losers: success for one, error for the other; actually, I'd prefer a seamless, silent, graceful "failure" for the second in that it'd merely NOT SEE those that would have been grabbed by the first instance.
Thus my dilemma.
The query I have is like this:
UPDATE my_table
SET processing_by = our_id_info -- unique to this worker
WHERE trans_nbr IN (
SELECT trans_nbr
FROM my_table
GROUP BY trans_nbr
HAVING COUNT(trans_nbr) > 1
LIMIT our_limit_to_have_single_process_grab
)
RETURNING row_id
My thought is: I would think that there's no lock, so no guarantee of "state" between the subquery and the outer update. So, how to ensure that any candidates THIS process gets, we grab, and they haven't been grabbed by another process in the mean time?
I've thought about adding a "FOR UPDATE ON my_table" at the end of the subquery, but that won't work; can't have this AND a "GROUP BY" (which is necessary for figuring out the COUNT of trans_nbr's). (As this would force any also-rans to be blocked pending our update, this would have been a preferred solution, as then it would both avoid a race-condition-caused error [two processes grabbing the same row{s}] and allow those other processes to be blissfully unawares and merely get rows that no longer include those the first process grabbed. Alas.)
I've thought about locking the table, but (in Postgres, at least) table locks are only released after a COMMIT; for testing purposes, I don't want to COMMIT, therefore during testing (yes, the pre-live testing on the live database AFTER testing on a test db) it wouldn't do to go this route. (Plus, even live, this would give an unacceptable performance hit given enough users/processes.)
I've thought about making the update dependent upon what the processing_by value was for our subquery, but, again, that won't work: if in the subquery, would break the GROUP BY/HAVING condition (as now there would be sub-groups of trans_nbr/processing_by being counted, which isn't what I'm after).
I'm expecting some shrill point in the Right Direction mocking me asking such an obvious question, but it's not been obvious to me (obviously ;o) and I assure you, I have been researching this for literally hours.
Thanks, much, for any hints, let alone solutions!
UPDATE: Thanks SO MUCH Chris Travers!
That ol' line about "Forrest for the Trees" comes to mind! :>
Here's a modified version of the query, taking this suggestion into account, and adding another "double check". This should be THE ONE.
UPDATE my_table
SET processing_by = our_id_info -- unique to this worker
WHERE trans_nbr IN (
SELECT trans_nbr
FROM my_table
WHERE trans_nbr IN (
SELECT trans_nbr
FROM my_table
GROUP BY trans_nbr
HAVING COUNT(*) > 1 -- Thanks for the suggestion, Flimzy
LIMIT our_limit_to_have_single_process_grab
)
AND processing_by IS NULL
/* Or some other logic that says "not currently being
processed". This way, we ALSO verify we're not
grabbing one that might have been UPDATEd/grabbed
during our sub-SELECT, while it was being
blocked/waiting.
This COULD go in our UPDATE/top-level, but unnecessary
rows could be locked by this lower-level in that case.
*/
FOR UPDATE /* Will block/wait for rows this finds to be unlocked by
any prior transaction that had a lock on them.
NOTE: Which _could_ allow the prior trans to change
our desired rows in the mean time, thus the
secondary WHERE clause.
*/
)
RETURNING row_id
I would love for Postgres to have a SKIP LOCKED-like feature. Especially for queues of essentially atomic rows that need to be processed without blocking other processing. But alas. Maybe someday...? Or "soon"? :-)
For now, one may add NOWAIT to NOT get blocked by any other transaction(s), however keep in mind, it merely dumps back with an error - you'll have to keep trying your query until it succeeds (or give up). Without NOWAIT, the query blocks until other transactions release their locks, or the query times out.
UPDATE 2: SO, after re-re-re-reading this and thinking about it, again "Forrest for the Trees" moment. I can simply do like this:
UPDATE my_table
SET processing_by = our_id_info -- unique to this worker
WHERE trans_nbr IN (
-- This query MAY pull ones we don't want to mess with (already "grabbed")
SELECT trans_nbr
FROM my_table
GROUP BY trans_nbr
HAVING COUNT(*) > 1
LIMIT our_limit_to_have_single_process_grab
AND processing_by IS NULL -- only "ungrabbed" ones (at this point)
)
AND processing_by IS NULL -- But THIS will drop out any "bogus" ones that changed between subquery and here
RETURNING row_id
COMMIT the transaction to release OUR locks, and Bob's yer uncle.
SKIP LOCKED would still be super-cool, though.
A CAVEATE: If one was to have workers pulling a limited (like LIMIT 1) number of rows and/or items must be grabbed in a certain order (e.g.: FIFO, either ORDER BY and/or by function like Min(id)), there can be cases of starved workers: a worker waits and waits, and when the row(s) they were waiting for unblocks, turns out none of them meet its final criteria. There are a number of ways to try to get around this, like having workers jumping around via OFFSET, but most are either complex or slow. (Usually both. BONUS!)
MY functionailty expects multiple rows returned, or none is A-OK - nothing to do for now; sleep for a bit and recheck, so this isn't a problem for me. It may be for you. If so, you'll want to consider a...
NON-BLOCKING VERSION: I found a great article working with this very problem, turns out, and it introduced me to Pg's Advisory Locks. (This one was quite informative, too.)
So, a non-blocking solution to my own problem should look like this:
UPDATE my_table
SET processing_by = our_id_info -- unique to this worker
WHERE trans_nbr IN (
-- This query MAY pull ones we don't want to mess with (already "grabbed")
SELECT trans_nbr
FROM my_table AS inner_my_table_1
GROUP BY trans_nbr
HAVING Count(*) > 1
AND Count(*) in ( -- For MY query, since I'm grouping-by, I want "all or none" of trans_nbr rows
SELECT Count(*)
FROM my_table AS inner_my_table_2
WHERE inner_my_table_2.trans_nbr = inner_my_table_1.trans_nbr
AND pg_try_advisory_xact_lock(id) -- INT that will uniquely ID this row
)
/* Note also that this will still lock all non-locked rows with this
trans_nbr, even though we won't use them unless we can grab ALL of the
rows with same trans_nbr... the rest of our query should be made
quick-enough to accept this reality and not tie up the server unduly.
See linked info for more-simple queries not doing group-by's.
*/
LIMIT our_limit_to_have_single_process_grab
AND processing_by IS NULL -- only "ungrabbed" ones (at this point)
)
AND processing_by IS NULL -- But THIS will drop out any "bogus" ones that changed between subquery and here
RETURNING row_id
NOTES:
- It's up to the applications to do/respect Advisory Locks, so this is no pancea, but nor is it a placebo. Again, SKIP LOCKED would be very handy because of this.
- pg_try_advisory_lock, since v 8.2, does not auto-unlock, (thus) may (MUST) be explicitly unlocked
- pg_try_advisory_xact_lock, since v 9.1, auto-unlocks at end of transaction, may NOT be explicitly unlocked
- I HAVE NOT TESTED THIS YET! I'll edit/update when I have...