可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have an int array[100] and I want 5 threads to calculate the sum of all array elements.

Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.

Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.

for(i=offset; i<offset+range; i++){
  // not used pthread_mutex_lock(&mutex);
  sum += array[i];
  // not used pthread_mutex_unlock(&mutex);
}

Can this lead to unpredictable behavior or does the OS actually handle this?

Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.

回答1:

Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:

You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.

There are multiple solutions to this is:

You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).

However as @gavinb mentioned all of it make sense only for larger amount of data.

回答2:

I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.

First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.

Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.

Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.

However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as @Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.

Can this lead to unpredictable behavior or does the OS actually handle this?

Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.

Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.

No, definitely not. It might run faster, but it will almost certainly not give you the correct result!

回答3:

In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.

Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.