I have an int array[100]
and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum
variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum
at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4]
and 2 threads t1
and t2
and sum
. To begin let's say t1
get value a1
and adds it to sum
. But it's not an atomic operation, so he copy current value of sum
(it's 0) to his local space, let's call it t1_s
, adds to it a1
and then write sum = t1_s
. But at the same time t2
do the same, he get sum
value (which is 0, because t1
have not completed it operation) to t2_s
, adds a3
and write to sum
. So we got in the sum
value of a3
insted of a1 + a3
. This is called data race.
There are multiple solutions to this is:
- You can use
mutex
as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
- Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
- Without array calculate local
sum_local
for each thread and in the end add all these sums to shared variable sum
using mutex. I guess it will be faster (however it need to be checked).
However as @gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array
variable is independent, however updating the sum
variable is not, so you would need a mutex to serialise access to sum
, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum
. If you calculate intermediate sums for each subset (as @Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum
.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum
variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.