Summary:
I am interested in knowing what's the best practice for high throughput applications that have bulk messages trying to update the same row and get oracle deadlock errors. I know you cannot avoid those errors but how do you recover from them gracefully without getting bogged down by such deadlock errors happening over and over again.
Details:
We are building a high throughput JMS messaging application. Production environment will be two weblogic 11g nodes (running 6 MDB listener instances each). We were getting Oracle deadlock errors (ORA-00060) when we get around 1000 messages all trying to update the same row in oracle database. Java synchronization across nodes is not possible in standard java threading API (unless there's no other solution we don't want to use any 3rd party solutions like terracotta etc).
We were hoping Oracle "select for update WAIT n secs" statement will help because that will essentially make the competing threads (for the same row) wait few seconds before the first thread (who got the lock on the row first) gets done with it.
First issue with "SELECT FOR UPDATE WAIT n" is it doesn't allow using milliseconds for wait times. This starts negatively affecting our application's throughput because putting 1 sec WAIT (least wait time) causes delays on the messages.
Second thing we are fiddling with weblogic queue re-delivery delay parameter (30 secs in our case). Whenever a thread bounces back because of the deadlock error, it will wait 30 seconds before being re-tried.
In our experience 1000 competing messages, in a lot of situations take forever to get processed because the deadlock keeps on happening over and over.
I understand that with the current architecture we are supposed to get deadlock errors regardless ( in case of 1000 competing messages) but application should be resilient enough to recover from these errors after retrying the looping messages.
Any idea what we are missing here ? anybody who has dealt with similar issues before?
I am looking for some design ideas that can make this work resiliently so that it recovers from this deadlock situation and eventually processes all messages in reasonable amount of time without using much additional hardware.
COMPUTATION DETAILS: These 1000 messages will EACH create 4 objects of 4 different position types each having a quantity associated with it. These quantities will have to merged into those 4 different slots (depending on the position type). The deadlock is happening when those 4 individual slots are being updated by each individual thread. We have already ordered those individual updates in a specific order before being applied to the database rows to avoid any possible race conditions.
As for the MDB
Let it consume the messages, and update instance variables which contain the delta of the quantities of the processed messages (an MDB can carry state in its instance variables across multiple messages).
A
@Schedule
method in the same MDB persists the quantities in a single database transaction using a single SQL statement every second (for example)I have done some tests:
Notes
@PreDestroy
method to persist the pending deltas to make sure that nothing gets lostreceive(timeout)
to collect 100 - 10000 messages (or until a timeout), do one DB transaction, and right after that the commit on the queue session. This is better, but it's still not XA transactional. If you need this, both commits need to be coordinated by a single XA transaction.A deadlock implies that each thread is trying to update multiple rows in a single transaction and that those updates are being done in a different order across threads. The simplest possible answer, therefore, would be to modify the code so that messages within the same transaction are applied in some defined order (i.e. in order of the primary key). That would ensure that you would never get a deadlock though you'd still get blocking locks while one thread waits for another thread to commit its transaction.
Taking a step back, though, it seems unlikely that you would really want many threads updating the same row in a table when you can't predict the order of the updates. It seems highly likely that would lead to lots of lost updates and some rather unpredictable behavior. What, exactly, is your application doing that would make this sort of thing sensible? Are you doing something like updating aggregate tables after inserting rows into a detail table (i.e. updating the count of the number of views a post has in addition to logging information about a particular view)? If so, do those operations really need to be synchronous? Or could you update the view count periodically in another thread by aggregating the views over the past N second?