Determinant calculation with CUDA [closed]

2019-06-24 00:29发布

问题:

Is there any library or freely available code which will calculate the determinant of a small (6x6), double precision matrix entirely on a GPU?

回答1:

Here is the plan, you will need to buffer 100s of these tiny matrices and launch the kernel once to compute the determinant for all of them at once.

I am not going to write actual code, but this should be helpful.

1) Launch # blocks = # matrices. Each block calculates determinant of each matrix.

2) det(A) = det(A11 * A22 - A21 * A12); where A is 6x6, A11, A12, A21, A22 are 3x3 sub matrices of A.

3) Write a device function that does matrix multiply for 3x3 matrices

4) det of a 3x3 matrix is simple to calculate: use the formula from here.

EDIT: Apparently (2) only works if A21 * A12 == A12 * A21

An alternative would be the following

1) LU Decomposition by Gaussian elimination for each 6x6 matrix

2) Multiply the diagonal elements of U to get determinant.



回答2:

As it has been already pointed out especially by Bart in the comments above, using GPUs to calculate the calculation of the determinant of small matrices (even of many of them) does not ensure gain over other computing platforms.

I think that the problem of calculating the determinant of matrices is per sé an interesting issue which may occur several times in the applications. Currently, I'm not aware of any library providing routines for determinant calculation with CUDA (neither cuBLAS nor cuSOLVER do have a such routine), so you have two possibilities:

  1. Implementing your own approach, as pointed out by Pavan;
  2. Manage to use other available routines.

Concerning the last point, one possibility is to use the Cholesky decomposition and then calculate the determinant as the squared of the product of the diagonal elements of the Cholesky matrix. In Matlab, it would read:

prod(diag(chol(A)))^2 

Below, I'm providing a code exploiting this idea. In particular, the Cholesky decomposition is performed by using the cuSOLVER's potrf function, while the product of the elements on the diagonal of the Cholesky matrix is an application of strided reduction with Thrust.

The code below is though for large matrices, so it will be immediately useful for people needing to calculate the determinant of large matrices. But how to adapt it to several small matrices? One possibility is to use cuSOLVER's streams for the Cholesky decomposition and then using the new dynamic parallelism feature of Thurst 1.8. Note that, as of CUDA 7.0, cuSOLVER does not allow the use of dynamic parallelism.

Here is the code:

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<ostream>

#include <cusolverDn.h>
#include <cublas_v2.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>

#include <thrust/fill.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#include <thrust/copy.h>

/*************************/
/* STRIDED RANGE FUNCTOR */
/*************************/
template <typename Iterator>
class strided_range
{
    public:

    typedef typename thrust::iterator_difference<Iterator>::type difference_type;

    struct stride_functor : public thrust::unary_function<difference_type,difference_type>
    {
        difference_type stride;

        stride_functor(difference_type stride)
            : stride(stride) {}

        __host__ __device__
        difference_type operator()(const difference_type& i) const
        {
            return stride * i;
        }
    };

    typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
    typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator;
    typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;

    // type of the strided_range iterator
    typedef PermutationIterator iterator;

    // construct strided_range for the range [first,last)
    strided_range(Iterator first, Iterator last, difference_type stride)
        : first(first), last(last), stride(stride) {}

    iterator begin(void) const
    {
        return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride)));
    }

    iterator end(void) const
    {
        return begin() + ((last - first) + (stride - 1)) / stride;
    }

    protected:
    Iterator first;
    Iterator last;
    difference_type stride;
};

int main(void)
{
    const int Nrows = 5;
    const int Ncols = 5;

    const int STRIDE = Nrows + 1;

    // --- Setting the host, Nrows x Ncols matrix
    double h_A[Nrows][Ncols] = { 
        { 2.,    -2.,    -2.,    -2.,    -2.,},  
        {-2.,     4.,     0.,     0.,     0.,}, 
        {-2.,     0.,     6.,     2.,     2.,}, 
        {-2.,     0.,     2.,     8.,     4.,}, 
        {-2.,     0.,     2.,     4.,     10.,}
    };

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- CUDA CHOLESKY initialization
    cusolveSafeCall(cusolverDnDpotrf_bufferSize(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, &work_size));

    // --- CUDA POTRF execution
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));
    cusolveSafeCall(cusolverDnDpotrf(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, work, work_size, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful potrf execution\n\n";

    cusolverDnDestroy(solver_handle);

    // --- Strided reduction of the elements of d_A: calculating the product of the diagonal of the Cholesky factorization  
    thrust::device_ptr<double> dev_ptr = thrust::device_pointer_cast(d_A);
    typedef thrust::device_vector<double>::iterator Iterator;
    strided_range<Iterator> pos(dev_ptr, dev_ptr + Nrows * Ncols, STRIDE);

    double det = thrust::reduce(pos.begin(), pos.end(), 1., thrust::multiplies<double>());
    det  = det * det;

    printf("determinant = %f\n", det);

    return 0;
}


回答3:

You could use OpenCL or CUDA as libary and wrote a short program (kernel in OpenCL) which calculates the determinant on GPU

CUDA http://www.nvidia.de/object/cuda_home_new_de.html

OpenCL http://www.khronos.org/opencl/

http://www.csd.uwo.ca/~moreno/Publications/DetHpcsPaper-proceedings.pdf

This paper should contain pseudo code for CUDA