I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
Experiments (code download)
Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.
FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU
.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL
float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version
fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
n, //dims -- this doesn't include zero-padding
depth, //howmany
h_in, //in
inembed, //inembed
depth, //istride
1, //idist
h_freq, //out
onembed, //onembed
depth, //ostride
1, //odist
FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU
. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
depth, //istride
1, //idist
onembed, //onembed
depth, //ostride
1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);
double start = read_timer();
for(int i=0; i<nIter; i++){
CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
Other notes
- In the GPU version,
cudaMemcpy
s between the CPU and GPU are not included in my computation time. - The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
- I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the
nvvp
profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.