I took the code given as an answer for How can I add up two 2d (pitched) arrays using nested for loops? and tried to use it for 3D instead of 2D and changed other parts slightly too, now it looks as follows:
__global__ void doSmth(int*** a) {
for(int i=0; i<2; i++)
for(int j=0; j<2; j++)
for(int k=0; k<2; k++)
a[i][j][k]=i+j+k;
}
int main() {
int*** h_c = (int***) malloc(2*sizeof(int**));
for(int i=0; i<2; i++) {
h_c[i] = (int**) malloc(2*sizeof(int*));
for(int j=0; j<2; j++)
GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
}
int*** d_c;
GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
GPUerrchk(cudaMemcpy(d_c,h_c,2*sizeof(int**),cudaMemcpyHostToDevice));
doSmth<<<1,1>>>(d_c);
GPUerrchk(cudaPeekAtLastError());
int res[2][2][2];
for(int i=0; i<2; i++)
for(int j=0; j<2; j++)
GPUerrchk(cudaMemcpy(&res[i][j][0],
h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));
for(int i=0; i<2; i++)
for(int j=0; j<2; j++)
for(int k=0; k<2; k++)
printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);
}
In the above code I use 2 as sizes for each of the dimension of h_c, in the real implementation I will have these sizes in very large numbers and in different ones for every part of the subarrays of "int***" or more dimensions. I am getting problem with the part after the kernel call where I try to copy back the results to res array. Can you help me fix the problem? Plz can you show solution in the way I am writing it above. Thanks!
First of all, I think talonmies when he posted the response to the previous question you mention, was not intending that to be representative of good coding. So figuring out how to extend it to 3D might not be the best use of your time. For example, why do we want to write programs which use exactly one thread? While there might be legitimate uses of such a kernel, this is not one of them. Your kernel has the possibility to do a bunch of independent work in parallel, but instead you are forcing it all onto one thread, and serializing it. The definition of the parallel work is:
Let's figure out how to handle that in parallel on the GPU.
Another introductory observation I would make is that since we are dealing with problems that have sizes that are known ahead of time, let's use C to tackle them with as much benefit as we can get from the language. Nested loops to do cudaMalloc may be needed in some cases, but I don't think this is one of them.
Here's a code that accomplishes the work in parallel:
Since you've asked for it in the comments, here is the smallest number of changes I could make to your code to get it to work. Let's also remind ourselves of some of talonmies comments from the previous question you reference:
"For code complexity and performance reasons, you really don't want to do that, using arrays of pointers in CUDA code is both harder and slower than the alternative using linear memory."
"it is such a poor idea compared to using linear memory."
I had to diagram this out on paper to make sure I got all my pointer copying correct.
In a nutshell, we have to do a successive sequence of: