Parallelization of Jacobi algorithm using eigen c+

I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.

void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);

}

void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;

VectorXd b(n);
VectorXd z(n);

v.setIdentity();    
z.setZero();

#pragma omp parallel for 
for (ip=0;ip<n;ip++)
{   
    d(ip)=a(ip,ip);
    b(ip)=d(ip);
}

for (i=0;i<50;i++) 
{
    sm=0.0;
    for (ip=0;ip<n-1;ip++) 
    {
        #pragma omp parallel for reduction (+:sm)
        for (iq=ip+1;iq<n;iq++)
            sm += fabs(a(ip,iq));
    }
    if (sm == 0.0) {
        break;
    }

    if (i < 3)
    tresh=0.2*sm/(n*n); 
    else
    tresh=0.0;  

    #pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
    for (ip=0;ip<n-1;ip++)
    {
    //#pragma omp parallel for private (g,h,t,theta,c,s,tau)
        for (iq=ip+1;iq<n;iq++)
        {
            g=100.0*fabs(a(ip,iq));
            if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
            a(ip,iq)=0.0;
            else if (fabs(a(ip,iq)) > tresh)
            {
                h=d(iq)-d(ip);
                if ((fabs(h)+g) == fabs(h))
                {
                    t=(a(ip,iq))/h;
                }   
                else 
                {
                    theta=0.5*h/(a(ip,iq));
                    t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                    if (theta < 0.0)
                    {
                        t = -t;
                    }
                    c=1.0/sqrt(1+t*t);
                    s=t*c;
                    tau=s/(1.0+c);
                    h=t*a(ip,iq);

                   #pragma omp critical
                    {
                    z(ip)=z(ip)-h;
                    z(iq)=z(iq)+h;
                    d(ip)=d(ip)-h;
                    d(iq)=d(iq)+h;
                    a(ip,iq)=0.0;


                    for (j=0;j<ip;j++)
                        ROTATE(a,j,ip,j,iq,s,tau);
                    for (j=ip+1;j<iq;j++)
                        ROTATE(a,ip,j,j,iq,s,tau);
                    for (j=iq+1;j<n;j++)
                        ROTATE(a,ip,j,iq,j,s,tau);
                    for (j=0;j<n;j++)
                        ROTATE(v,j,ip,j,iq,s,tau);
                    }

                }
            } 
        }
    }


}

}

I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:

 //#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
 //#pragma omp parallel for private (g,h,t,theta,c,s,tau)

are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.

I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.

Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.

My thanks in advance.

Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the

#pragma omp for

actually decrease the computation time.

I'll try to help, but I'm not sure this is the answer to your question.

There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.

For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.

I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.

Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?

How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.

Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.

My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.