We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to computation on a GPU rather than a CPU. For example, this page has some benchmarks with a Core 2 Quad and a GF 8800 GTX that show a 10-fold decrease in calculation time when using the GPU:
http://www.cv.nrao.edu/~pdemores/gpu/
However, in our product, size constraints restrict us to small form factors such as PC104 or Mini-ITX, and thus to rather limited embedded GPUs.
Is offloading computation to the GPU something that is only worth doing with meaty graphics cards on a proper PCIe bus, or would even embedded GPUs offer performance improvements?
The 8800 has on the order of 100 cores running at around half a GHz. I don't think any of the current embedded GPUs for small form factors have anywhere near as many shader/compute cores.
I would like to add on your question specifically about embedded GPUs.
They generally have very few shader cores, lesser registers for a core and lower memory bandwidth compared to high end GPUs seen on desktops. However, running FFT like applications on an embedded GPU can give a better performance compared to an onboard multicore CPU[1]. Major advantage in embedded GPUs is that they share a common memory with CPU thereby avoiding the memory copy process from host to device.
Almost all the embedded GPUs like Mali from ARM, adreno from Qualcomm etc support OpenCL, thus using an OpenCL library for your FFT on an embedded GPU can give a better performance (clFFT from AMD is well known and opensource). Tuning the OpenCL code for embedded GPU architecture can make it better.( please refer ARM Mali-T600 Series GPU OpenCL Developer Guide at http://infocenter.arm.com )
[1] Arian Maghazeh,Unmesh, Bordoloi Petru, Eles Peng. General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?
Having developed FFT routines both on x86 hardware and GPUs (prior to CUDA, 7800 GTX Hardware) I found from my own results that with smaller sizes of FFT (below 2^13) that the CPU was faster. Above these sizes the GPU was faster. For instance, a 2^16 sized FFT computed an 2-4x more quickly on the GPU than the equivalent transform on the CPU. See a table of times below (All times are in seconds, comparing a 3GHz Pentium 4 vs. 7800GTX. This work was done back in 2005 so old hardware and as I said, non CUDA. Newer libraries may show larger improvements)
As suggested by other posters the transfer of data to/from the GPU is the hit you take. Smaller FFTs can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small FFTs (below ~1024 points). If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
I would suggest using FFTW if you want a fast FFT implementation, or the Intel Math Library if you want an even faster (commercial) implementation. For FFTW, performing plans using the FFTW_Measure flag will measure and test the fastest possible FFT routine for your specific hardware. I go into detail about this in this question.
For GPU implementations you can't get better than the one provided by NVidia CUDA. The performance of GPUs has increased significantly since I did my experiments on a 7800GTX so I would suggest giving their SDK a go for your specific requirement.
You need to compare the cost of moving data to and from GPU memory versus any speed benefit from using the GPU. Although it's possible to overlap the I/O and the computation somewhat, you may still suffer if the I/O bandwidth requirements are greater than the computational bandwidth. If you have any additional computation that can be performed on the FFT data while it's resident in GPU memory then this can help to mitigate the I/O cost.
It's also important to note that GPU based FFTs typically only give good performance for single precision data. Furthermore you need to compare against the best possible CPU-based FFT, e.g. FFTW built for single precision and using SSE.
One problem might be getting the technical information you need to load and execute code on the GPU and communicate and exchange data with the CPU . Nvidia provide an API called CUDA specifically for this purpose. So choose a board with an Nvidia GPU that supports CUDA and you can probably experiment and benchmark at very little cost, and even prototype on a regular Desktop PC.
With respect to small form-factor hardware, this discussion may be relevant.