I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.
(The parallelism is in the function "computeEvec" inside global computeEHfields function function.) The problems are:
1) The mem check error below.
2) When i use numPointsRp>2000 it show me "out of memory" ,but (if i am not doing wrong) i compute the global memory and it's ok.
-------------------------------UPDATED---------------------------
i run the program with cuda-memcheck and it gives me (only when numPointsRs>numPointsRp):
========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
========= ========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
========= ========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
ERROR SUMMARY: 160 errors
-----------EDIT----------------------------
Also , some times (if i use only threads and not blocks (i haven't test it for blocks) ) if for example i have numPointsRs=1000 and numPointsRp=100 and then change the numPointsRp=200 and then again change the numPointsRp=100 i am not taking the first results!
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
Rs=np.zeros((numPointsRs,3)).astype(np.float32)
for k in range (numPointsRs):
Rs[k]=[0,k,0]
Rp=np.zeros((numPointsRp,3)).astype(np.float32)
for k in range (numPointsRp):
Rp[k]=[1+k,0,0]
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
#define RowRsSize %(numrs)d
#define RowRpSize %(numrp)d
typedef pycuda::complex<float> cmplx;
extern "C"{
__device__ void computeEvec(float Rs_mat[][3], int numPointsRs,
cmplx J[][3],
cmplx M[][3],
float *Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
float Rs_mat[RowRsSize][3];
float Rp_mat[RowRpSize][3];
cmplx J[RowRsSize][3];
cmplx M[RowRsSize][3];
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
"""% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
print(" \n")
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
--------------------GPU MODEL------------------------------------------------
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP
Now we have some real code to work with, let's compile it and see what happens. Using
RowRsSize=2000
andRowRpSize=200
and compiling with the CUDA 4.2 toolchain, I get:The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.
As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:
and the main __device__ function it calls to something like this:
and eliminate every byte of thread local memory without changing the functionality of the computational code at all.
I'm not familiar with pycuda and didn't read into your code too deeply. However you have more blocks and more threads, so it will
local memory (probably the kernel's stack, it's allocated per thread),
shared memory (allocated per block), or
global memory that gets allocated based on
grid
orgridDim
.You can reduce the stack size calling
(the code is for the C runtime API, but the pycuda equivalent shouldn't be too hard to find).