OpenMP and GSL RNG - Performance Issue - 4 threads

I am trying to turn a C project of mine from sequential into parallel programming. Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program.

I wrote some code lines (see below) to show the problem I am facing without much verbosity.

The problem is the following: everytime the number of threads nt increases, the performance gets singnificantly worse. At this workstation (linux kernel 2.6.33.4; gcc 4.4.4; intel quadcore CPU) the parallel for-loop takes roughly 10x longer to finish with nt=4 than with nt=1, regardless of the number of iterates n.

This situation seems to be described here but the focus is mainly in fortran, a language I know very little about, so I would very much appreciate some help.

I tried to follow their idea of creating different RNG (with a different seed) to be accessed by each thread but the performance is still very bad. Actually, this different seeding point for each thread bugs me as well, because I cannot see how it is possible for one to guarantee the quality of the generated numbers in the end (lack of correlations, etc).

I have already thought of dropping GSL altogether and implementing a random generator algorithm (such as Mersenne-Twister) myself but I suspect I would just bump into the same issue later on.

Thank you very much in advance for your answers and advice. Please do ask anything important I may have forgotten to mention.

EDIT: Implemented corrections suggested by lucas1024 (pragma for-loop declaration) and JonathanDursi (seeding; setting "a" as a private variable). Performance is still very sluggish in multithread-mode.

EDIT 2: Implemented solution suggested by Jonathan Dursi (see comments).

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <gsl/gsl_rng.h>
    #include <omp.h>

    double d_t (struct timespec t1, struct timespec t2){

        return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }

    int main (int argc, char *argv[]){

        double a, b;

        int i,j,k;

        int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]);

        printf("\nn\t= %d", n);
        printf("\nseed\t= %d", seed);
        printf("\nnt\t= %d", nt);

        struct timespec t1, t2, t3, t4;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

        //initialize gsl random number generator
        const gsl_rng_type *rng_t;
        gsl_rng **rng;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;

        rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *));

            #pragma omp parallel for num_threads(nt)
        for(i=0;i<nt;i++){
            rng[i] = gsl_rng_alloc (rng_t);
            gsl_rng_set(rng[i],seed*i);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);

        for (i=0;i<n;i++){
            a = gsl_rng_uniform(rng[0]);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);

        omp_set_num_threads(nt);
        #pragma omp parallel private(j,a)
        {
            j = omp_get_thread_num();
            #pragma omp for
            for(i=0;i<n;i++){
                a = gsl_rng_uniform(rng[j]);
            }
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4);

        printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2));
        printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3));
        printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3));
        printf("\nnumber of threads:\tnt = %d\n", nt);

        //free random number generator
        for (i=0;i<nt;i++)
            gsl_rng_free(rng[i]);
        free(rng);

        return 0;

    }

The problem is in the second #pragma omp line. The first #pragma omp spawns 4 threads. After that you are supposed to simply say #pragma omp for - not #pragma omp parallel for.

With the current code, depending on your omp nesting settings, you are creating 4 x 4 threads that are doing the same work and accessing the same data.