The following program is essentially the same as the one described here. When I run and compile the program using two threads (NTHREADS == 2), I get the following run times:
real 0m14.120s
user 0m25.570s
sys 0m0.050s
When it is run with just one thread (NTHREADS == 1), I get run times significantly better even though it is only using one core.
real 0m4.705s
user 0m4.660s
sys 0m0.010s
My system is dual core, and I know random_r is thread safe and I am pretty sure it is non-blocking. When the same program is run without random_r and a calculation of cosines and sines is used as a replacement, the dual-threaded version runs in about 1/2 the time as expected.
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#define NTHREADS 2
#define PRNG_BUFSZ 8
#define ITERATIONS 1000000000
void* thread_run(void* arg) {
int r1, i, totalIterations = ITERATIONS / NTHREADS;
for (i = 0; i < totalIterations; i++){
random_r((struct random_data*)arg, &r1);
}
printf("%i\n", r1);
}
int main(int argc, char** argv) {
struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]);
pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]);
}
for (t = 0; t < NTHREADS; t++) {
pthread_join(thread_ids[t], NULL);
}
free(thread_ids);
free(rand_states);
free(rand_statebufs);
}
I am confused why when generating random numbers the two threaded version performs much worse than the single threaded version, considering random_r is meant to be used in multi-threaded applications.
A very simple change to space the data out in memory:
results in a much faster running time on my dual-core machine.
This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20.
Work out your cache line size, and create each thread's data so it is aligned to it.
It's a bit cleaner to plonk all the thread's data into a struct, and align that:
with
CACHE_LINE_SIZE
64:Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.
(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)
I don't know if this is relevant or not - but i just saw a very similar behavior (order of magnitude slower with 2 threads than with one) ... I basically changed a:
to a
and that "fixed" it (2 threads is now reliably almost twice as fast - e.g. 19s instead of 35s).
I don't know what the issue could have been -- locking or cache coherence on the internals of
rand()
maybe? Anyway, there is also arandom_r()
so maybe that would be of use to you (a year ago) or someone else.