I make following this original post : PyCuda code to invert a high number of 3x3 matrixes. The code suggested as an answer is :
$ cat t14.py
import numpy as np
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off >>= 4;
return ret;
}
// in-place is acceptable i.e. out == in)
// T = float or double only
const int block_size = 288;
typedef double T; // *** can set to float or double
__global__ void inv3x3(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
T det = 1;
if (idx < n*9)
det = in[idx];
unsigned sibase = (threadIdx.x / 9)*9;
unsigned lane = threadIdx.x - sibase; // cheaper modulo
si[threadIdx.x] = det;
__syncthreads();
unsigned off = pat[lane];
T a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
T b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
a -= b;
__syncthreads();
if (lane == 0) si[sibase+3] = a;
if (lane == 3) si[sibase+4] = a;
if (lane == 6) si[sibase+5] = a;
__syncthreads();
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
if (idx < n*9)
out[idx] = a / det;
}
""")
# host code
def gpuinv3x3(inp, n):
# internal constants not to be modified
hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
# Convert parameters into numpy array
# *** change next line between float32 and float64 to match float or double
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
# *** change next line between float32 and float64 to match float or double
output = np.empty((n*9), dtype= np.float64)
# Get kernel function
matinv3x3 = kernel.get_function("inv3x3")
# Define block, grid and compute
blockDim = (288,1,1) # do not change
gridDim = ((n/32)+1,1,1)
# Kernel function
matinv3x3 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0)
n = 2
result = gpuinv3x3(inp, n)
print(result.reshape(2,3,3))
The result gives, on an initial 1D array containing 18 values (so 2 matrixes 3x3), the right inverted matrixes, i.e :
[[[ 2. -0. -1. ]
[-1. -0.33333333 1. ]
[-0. 0.33333333 -0. ]]
[[ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0. 0. 1. ]]]
Main issue : I would like to understand in detail the working of this algorithm, especially how the kernel allows to use shared memory for the initial 1D vector and brings then optimization when I execute this code on a large number of 3x3 matrixes.
1) I understand the line : size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
which gives the global index of current work-item identified by local threadIdx and blockIdx of the current working-group block.
2) I understand that __shared__ T si[block_size];
represents a share array, i.e associated to work-group blocks : this is what we call Local Memory
.
3) On the other hand, I don't understand this following part of kernel code :
shared T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
T det = 1;
if (idx < n*9)
det = in[idx];
unsigned sibase = (threadIdx.x / 9)*9;
unsigned lane = threadIdx.x - sibase; // cheaper modulo
si[threadIdx.x] = det;
__syncthreads();
unsigned off = pat[lane];
c
__syncthreads();
if (lane == 0) si[sibase+3] = a;
if (lane == 3) si[sibase+4] = a;
if (lane == 6) si[sibase+5] = a;
__syncthreads();
Indeed, what's the role of sibase
index defined by unsigned sibase = (threadIdx.x / 9)*9;
and also, what is the utility of parameter lane
defined by : unsigned lane = threadIdx.x - sibase; // cheaper modulo
Finally, shifting are applied with :
T a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
T b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
a -= b;
But I don't see clearly the functionality.
4) Same problem for me about this part :
if (lane == 0) si[sibase+3] = a;
if (lane == 3) si[sibase+4] = a;
if (lane == 6) si[sibase+5] = a;
5) The determinant is calculated in a weird way that I can't grasp, i.e :
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
If someone could explain in a pedagogical way all these different steps, this would be nice.
I am not beginner in OpenCL, but I am not enough expert to understand fully this kernel code.
Preliminaries
First, its important to understand the arithmetic of a 3x3 matrix transpose, see here (and below).
The general methodology used for kernel design is to assign one matrix result element per thread. Therefore I will need 9 threads per matrix. Ultimately each thread will be responsible for computing one of the 9 numerical results, for each matrix. In order to compute two matrices, we then need 18 threads, 3 matrices require 27 threads.
An ancillary task is to decide threadblock/grid sizing. This follows typical methods (overall problem size determines total number of threads needed), but we will make a specific choice of 288 for threadblock size, as this is a convenient multiple of both 9 (number of threads per matrix) and 32 (number of threads per warp in CUDA), which gives us a certain measure of efficiency (no wasted threads, no gaps in data storage).
Since our thread strategy is one thread per matrix element, we must collectively solve the matrix inversion arithmetic using 9 threads. The major tasks are to compute the transposed matrix of cofactors, and then to compute the determinant, then do the final arithmetic (divide by the determinant) to compute each result element.
Computation of the cofactors
The first task is to compute the transposed matrix of cofactors of
A
, calledM
:We have 9 threads for this task, and nine elements of matrix
M
to compute, so we will assign one thread to each element ofM
. Each element ofM
depends on multiple input values (a
,b
,c
, etc.) so we shall first load each input value (there are 9, one per thread), into shared memory:now that each
A
matrix element (a
,b
,c
, ...) is loaded into shared memory, we can start to compute the cofactors inM
. Let's focus on a particular thread (0) and its cofactor (ei-fh
). All of the needed matrix elements to compute this cofactor (e
,i
,f
, andh
) are now in shared memory. We need a method to load them in sequence, and perform the needed multiplications and subtractions.At this point we observe two things:
M
element (cofactor) has a different set of 4 needed elements ofA
M
element (cofactor) follows the same general arithmetic, given four arbitrary elements ofA
, lets refer to them generically asX
,Y
,Z
andW
. The arithmetic is XY-ZW. I take the first element, multply it by the second, and then take the third and fourth element and multiply them together, then subtract the two products.Since the general sequence of operations (2, above) is the same for all 9 cofactors, we only need a method to arrange the loading of the 4 needed matrix elements. This methodology is encoded into the load patterns that are hard-coded into the example:
There are 9 load patterns, each occupying a hexadecimal quantity, one load pattern per thread, i.e. one load pattern per
M
matrix element (cofactor). Within a particularA
matrix, the matrix elementsa
,b
,c
etc. are (already) loaded into shared memory at group offsets of 0, 1, 2, etc. The load pattern for a given thread will allow us to generate the sequence of group offsets, needed to retrieve the matrix elements ofA
from their locations in shared memory, to be used in sequence to compute the cofactor assigned to that thread. Considering thread 0, and its cofactorei-fh
, how does the load pattern0x7584
encode the needed pattern to selecte
, theni
, thenf
, thenh
?For this we have a helper function
getoff
which takes a load pattern, and successively (each time it is called) strips off an index. The first time I callgetoff
with an argument of0x7584
, it "strips off" the index 4, returns that, and replaces the0x7584
load pattern with0x758
for the next usage. 4 corresponds toe
. The next time I callgetoff
with0x758
it "strips off" the index 8, returns that, and replaces0x758
with0x75
. 8 corresponds toi
. The next time produces the index 5, corresponding tof
, and the last time produces the index 7, corresponding toh
.With that description then we will walk through the code, pretending we are thread 0, and describe the process of computing
ei-fh
:sibase
, as already indicated in the first commented code section, is the base offset in shared memory where thatA
matrix elements are stored. Thegetoff
function then adds to this base address to select the relevant input element.Computation of the determinant
The numerical value of the determinant is given by:
If we decompose this, we see that all terms are actually already computed:
Now, every thread will need the value of the determinant because it will be used by each thread during computation of its final (result) element. Therefore we will have every thread in the matrix redundantly compute the same value (which is more efficient than computing it, say, in one thread, then broadcasting that value to the other threads). In order to facilitate this, we will need 3 of the already computed cofactors made available to all 9 threads. So we will select 3 (no longer needed) locations in shared memory to "publish" these values. We still need the values in locations 0, 1, 2 because we need the input matrix elements
a
,b
, andc
for calculation of the determinant. But we no longer need the input elements in locations 3, 4, or 5 for the remainder of our work, so we will reuse those:Calculation of final result
This only involves (for each thread) dividing the previously computed cofactor for that thread, by the just-computed determinant, and storing that result: