Large matrix multiplication on gpu

2020-03-17 04:46发布

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.

Thank you

1条回答
Anthone
2楼-- · 2020-03-17 05:20

There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.

To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:

enter image description here

which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).

In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

查看更多
登录 后发表回答