I have a CUDA program for calculating FFTs of, let's say, size 50000
. Currently, I copy the whole array to the GPU and execute the cuFFT. Now, I am trying to optimize the programm and the NVIDIA Visual Profiler tells me to hide the memcopy by concurrency with parallel computations. My question is:
Is it possible, for example, to copy the first 5000
Elements, then start calculating, then copying the next bunch of data in parallel to calculations etc?
Since a DFT is basically a sum over the time values multiplied with a complex exponential function, I think that it should possible to calculate the FFT "blockwise".
Does cufft support this? Is it in general a good computational idea?
EDIT
To be more clear, I do not want to calculate different FFTs parallel on different arrays. Lets say I have a big trace of a sinusoidal signal in the time domain and I want to know which frequencies are in the signal. My Idea is to copy, for example, one third of the signal length to the GPU, then the next third and calculate the FFT with the first third of the already copied input values parallel. Then copy the last third and update the output values until all the time values are processed. So in the end there should be one output array with a peak at the frequency of the sinus.
Please, take into account the comments above and, in particular, that:
Npartial
elements, you will have an output ofNpartial
elements;Taking into account the above two points, I think you can only "emulate" what you like to achieve if you properly use zero padding in the way illustrated by the code below. As you will see, letting
N
to be the data size, by dividing the data inNUM_STREAMS
chunks, the code performsNUM_STREAMS
zero padded and streamed cuFFT calls of sizeN
. After the cuFFT, you have to combine (sum) the partial results.This is the timeline of the above code when run on a Kepler K20c card. As you can see, the computation overlaps the async memory transfers.