I wrote a Conjugate-gradient solver (for linear system of equations) with LU preconditioning, I used Dr. Maxim Naumov's papers on nvidia's research community as a guideline, the residuals update step, which requires solving a lower triangular matrix system and then solving an upper triangular matrix system is divided into two phases:
- analysis phase (which exploits the sparsity pattern and decides the parallelization level).
- the solution phase itself.
according to all posts related to this topic, and to Naumov's paper itself, the analysis phase is significantly slower than the solve phase, but it's executed once, so it's not supposed to be an issue when taking the whole execution time in consideration, however, in my program, the analysis phase requires ~35-45% of the whole solution time (time needed to perform all iterations!), which is quite annoying, another thing is that I'm quite sure that the matrix's sparsity pattern doesn't allow a big deal of parallelization, as nearly all elements require previous elements to be known (because in CFD each node will need at least the adjacent 6 nodes (hexahedral volumes) around it, and that's repeated on every row), so the analysis phase won't be very useful anyways !
matrixLU in this code contains both the upper and lower triangular matrices, the upper triangular matrix uses the original matrix diagonals, and the lower triangular matrix has unity diagonals (LU factorization).
// z = inv(matrixLU)*r
cusparseMatDescr_t descrL = 0 ;
cusparseMatDescr_t descrU = 0 ;
cusparseStatus = cusparseCreateMatDescr(&descrL) ;
cusparseStatus = cusparseCreateMatDescr(&descrU) ;
cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT) ;
cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER) ;
cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT) ;
cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER) ;
cusparseSolveAnalysisInfo_t inforL = 0 ;
cusparseSolveAnalysisInfo_t inforU = 0 ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforL) ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforU) ;
double startFA = omp_get_wtime() ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrL, matrixLU, iRow, jCol, inforL) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis1 Error !") ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrU, matrixLU, iRow, jCol, inforU) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis2 Error !") ;
double finishFA = omp_get_wtime() ;
So, anybody got a clue why the analysis phase is so slow ? and how it may be accelerated ? is the (analysis phase execution time)/(solve phase execution time) ratio GPU dependent ?
Edit: I tried this solver for many cases, and the results were close, but in the specific case I'm concerned about, it has the following conditions:
- Size (N) : ~860,000 * 860,000
- Number of non zeros (NZ): ~6,000,000
- Number of iterations required for convergence: 10
- Analysis phase execution time: 210 ms
- Solve phase executions time (summing up all iterations): 350 ms
- All floating point operations are performed in double precision format
- GPU: GeForce GTX 550 Ti
- OS: Windows 7 ultimate, 64 bit
I contacted our linear algebra libraries team at NVIDIA and they provided the following feedback.
With respect to the performance results:
The CG iterative method performs only 10 iterations until the solution of the linear system is found. It seems that in this case this is not enough iterations to ammortize the time consumed by the analysis phase. (Edit:) The number of steps required to converge to the solution varies across different applications. In some unpreconditioned iterative methods do not converge (or take thousands of iterations) and preconditioned iterative methods take tens (or hundreds) of iterations to converge to a solution. The fact that in this particular application the iterative method converges in 10 iterations is not necessarily representative of all applications.
The ratio of the analysis and solve phases is 6 = 210/35, which is in line with our experiments on other matrices, where it is usually in the interval [4,10]. It is not clear how the time for the analysis and solve phases compares to the time taken by the sparse triangular solve on the CPU. (This would be interesting information to know.)
Finally, you also mentioned that you think there is not enough parallelism in the problem. But the amount of parallelism can be hard to guess just from looking at the matrix. If indeed there is little parallelism (every row depends on the previous row), then the CUSPARSE sparse triangular solve might not be the right algorithm for this problem. Also, you can try to run the CG iterative method without preconditioning or with other types of preconditioners that might be more suitable to your problem.
As mentioned earlier, it would be interesting to know how the performance of this code compares on the GPU and CPU.
Edit: Regarding some comments... As mentioned earlier the ultimate number of iterations for a preconditioned iterative method varies across different applications. In some cases it can be very small (for example, 10 as it is in your application), but in others it can be relatively large (measured in 100s). The paper focuses not on the “triumph”, but rather the tradeoffs of the algorithm. It attempts to give the user a deeper understanding of the routine in order for it to be used effectively. Again, if the sparsity pattern at hand does not have any parallelism, this is not the right algorithm for you.
With respect to the CPU and GPU comparisons, there are certain problems (such as dense matrix-matrix or sparse matrix-vector multiplication) for which GPU is great, there are others (such as traversing a linked list or executing a completely sequential piece of code) for which it is not. The solution of sparse triangular linear systems lies somewhere in between those algorithms (depending on the sparsity pattern of the coefficient matrix and other properties of the linear system it might or not work well for you). Also, the decision to use the GPU is not based solely on a single algorithm, such as sparse triangular solve, but on the entire set of algorithms used by the application. Ultimately, GPUs are often very successful in accelerating and improving the performance of the overall application.
Finally, with respect to trsm vs. csrsv, it does not surprise us that for small matrices performing operations in dense storage is faster than in sparse storage (even though these small matrices might be sparse). This would usually be true not only for triangular solve, but for other linear algebra operations as well (it might happen at different cross points depending on the operation and the sparsity pattern of the matrix, though). Also, once again, the sparse triangular solve algorithm was designed to run in an iterative setting where the analysis is done once and solve done multiple times. So, it is not surprising that running it once (and counting in the analysis phase), results in a higher crossover point for this operation to be more effective in sparse rather than in dense format.