running FFTW on GPU vs using CUFFT

I have a basic C++ FFTW implementation that looks like this:

for (int i = 0; i < N; i++){
     // declare pointers and plan
     fftw_complex *in, *out;
     fftw_plan p;

     // allocate 
     in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
     out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

     // initialize "in"
     ...

     // create plan
     p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);

     // execute plan
     fftw_execute(p);

     // clean up
     fftw_destroy_plan(p);
     fftw_free(in); fftw_free(out);
}

I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.

I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:

#define NX 64
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);

/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);

/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);

/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);

However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:

#define NX 64
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);

I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?

You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.

If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.

cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.

cufft also supports batched plans which is another way to execute multiple transforms "at once".