I am working in parallelise [this file][1] on GPU using [PTX file with matlab parallel.gpu.CUDAkernel][2]. My problem with [kron tensor product][3] is the following. My code should multiply two vectors kron(a,b)
by multiplying each element of the first vector a=<32x1>
by the all elements of the other vector b=<1x32>
and the output vector size will be k<32x32>=a.*b
. I tried to write it in C++ and it worked, as I only concern about summing all the elements of 2d array. I thought I can make it easy as 1D array because m=sum(sum(kron(a,b)))
is the code I am working on
for(i=0;i<32;i++)
for(j=0;j<32;j++)
k[i*32+j]=a[i]*b[j]
It meant to have the a[i]
th element multiply by eachelement in b
and I though to go with 32
blocks with each block has a 32
threads and the code should be
__global__ void myKrom(int* c,int* a, int*b) {
int i=blockDim.x*blockIdx.x+threadIdx.x;
while(i<32) {
c[i]=a[blockIdx.x]+b[blockDim.x*blockIdx.x+threadIdx.x];
}
That should make the trick as the blockIdx.x
is the outer loop, but it didn't. Could any body tell me where, may i ask for parallel way to do the parallel sum.
In case when the first operand is the identity matrix, then the result of the Kronecker product can be simply represented using cuSPARSE's
bsr
's format.Below, a simple example implementing the following Matlab instructions
KRON(I, T)
Also in simple cases when the second operand is the identity matrix, the result of the Kronecker product can be represented using cuSPARSE's
bsr
's format.Below, a simple example
KRON(S, I)
You may actually mean something like this:
when you call the kernel by
myKrom<<<32, 32>>> (c, a, b);