Cuda: least square solving , poor in speed

Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...

In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .

I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.

So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .

I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?

The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'

while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then

        MaxDex_Host[itera]=pos;
    itera++; 
    float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
    for (int j=0;j<itera;j++) // to  get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose , 
    {
        for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
        {
            A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
        }
    }
          // I must allocate one more array because culaDeviceSgels will decompose the one input Array ,  and I want to use A_temp after least-square solving.
    float* A_temp_gpu;
    float* A_temp2_gpu;  
    cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
    cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
    cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
    cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
    culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value ,  stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
     float* x_temp;
    cudaMalloc((void**)&x_temp,Size_float*itera);
    cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);

Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?

I think that custreamdestory and gemm_kernel1x1val are internally called by the APIs you are using, so there is not much to do with them.

To improve your code, I would suggest to do the following.

You can get rid of A_temp_cpu by keeping a device copy of the matrix A. Then you can copy the rows of A into the rows of A_temp_gpu and A_temp2_gpu by a kernel assignment. This would avoid performing the first two cudaMemcpys.
You can preallocate A_temp_gpu and A_temp2_gpu outside the while loop by using the maximum possible value of itera instead of itera. This will avoid the first two cudaMallocs inside the loop. The same applies to x_temp.
As long as I know, culaDeviceSgels solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first by cublasDgetrfBatched() and then use cublasStrsv() two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.