On a server application we have the following:
A class called a JobManager that is a singleton.
Another class, the Scheduler, that keeps checking if it is time to add any sort of job to the JobManager.
When it is time to do so, the Scheduler do something like:
TJobManager.Singleton.NewJobItem(parameterlist goes here...);
At the same time, on the client application, the user do something that generates a call to the server. Internally, the server sends a message to itself, and one of the classes listening for that message is the JobManager.
The JobManager handles the message, and knows that it is time to add a new job to the list, calling its own method:
NewJobItem(parameter list...);
On the NewJobItem method, I have something like this:
CS.Acquire;
try
DoSomething;
CallAMethodWithAnotherCriticalSessionInternally;
finally
CS.Release;
end;
It happens that the system reaches a deadlock at this point (CS.Acquire).
The communication between client and server application, is made via Indy 10.
I think, the RPC call that fire the server application method that sends a message to the JobManager is running on the context of the Indy Thread.
The Scheduler has its own thread running, and it makes a direct call to the JobManager method. Is this situation prone to deadlocks?
Can someone help me understand why a deadlock is happening here?
We knew that, sometimes, when the client did a specific action, that cause the system to lock, then I could finally find out this point, where the critical section on the same class is reached twice, from different points (the Scheduler and the message handler method of the JobManager).
Some more info
I want to add that (this may be silly, but anyway...) inside the DoSomething there is another
CS.Acquire;
try
Do other stuff...
finally
CS.Release;
end;
This internal CS.Release is doing anything to the external CS.Acquire? If so, this could be the point where the Scheduler is entering the Critical Section, and all the lock and unlock becomes a mess.
There isn't enough information about your system to be able to tell you definitively if your JobManager and Scheduler are causing a deadlock, but if they are both calling the same NewJobItem method, then this should not be the problem since they will both acquire the locks in the same order.
For your question if your NewJobItem CS.acquire and DoSomething CS.acquire interact with each other: it depends. If the lock object used in both methods is different, then no the two calls should be independant. If it's the same object then it depends on the type of lock. If you locks are re-entrant locks (eg. they allow acquire to be called multiple times from the same thread and count how many time they have been acquired and released) then this should not be a problem. On the other hand if you have simple lock objects that don't support re-entry, then the DoSomething CS.release could release your lock for that thread and then the CallAMethodWithAnotherCriticalSessionInternally would be running without the protection of the CS lock that was acquired in NewJobItem.
Deadlocks occur when there are two or more threads running and each thread is waiting for another thread to finish it's current job before it can continue its self.
For Example:
Thread 1 executes:
lock_a.acquire()
lock_b.acquire()
lock_b.release()
lock_a.release()
Thread 2 executes:
lock_b.acquire()
lock_a.acquire()
lock_a.release()
lock_b.release()
Notice that the locks in thread 2 are acquired in the opposite order from thread 1. Now if thread 1 acquires the lock_a and then is interrupted and thread 2 now runs and acquires lock_b and then starts waiting for lock_a to be available before it can continue. Then thread 1 continues running and the next thing it does is try to acquire lock_b, but it is already taken by thread 2 and so it waits. Finally we are in a situation in which thread 1 is waiting for thread 2 to release lock_b and thread 2 is waiting for thread 1 to release lock_a.
This is a deadlock.
There are several common solutions:
- Only use one shared global lock in all your code. This way it is impossible to have two threads waiting for two locks. This makes your code wait a lot for the lock to be available.
- Only ever allow your code to hold one lock at a time. This is usually too hard to control since you might not know or control the behavior of method calls.
- Only allow your code to acquire multiple locks all at the same time, and release them all at the same time, and disallow acquiring new locks while you already have locks acquired.
- Make sure that all locks are acquired in the same global order. This is a more common technique.
With solution 4. you need to be careful programming and always make sure that you acquire the locks/critical sections in the same order. To help with debugging you can place a global order on all the locks in your system (eg. just a unique integer for each lock) and then throwing an error if you try to acquire a lock that has a lower ranking that a lock that the current thread already has acquired (eg. if new_lock.id < lock_already_acquired.id then throw exception)
If you can't put in a global debugging aid to help find which locks have been acquired out of order, the I'd suggest that you find all the places in your code that you acquire any lock and just print a debugging message with the current time, the method calling acquire/release, the thread id, and the lock id that is being acquired. Also do the same thing with all the release calls. Then run your system until you get the deadlock and find in your log file which locks have been acquired by which threads and in which order. Then decide which thread is accessing it's locks in the wrong order and change it.