I have a basic C++ FFTW implementation that looks like this:
for (int i = 0; i < N; i++){
// declare pointers and plan
fftw_complex *in, *out;
fftw_plan p;
// allocate
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
// initialize "in"
...
// create plan
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
// execute plan
fftw_execute(p);
// clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.
I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?