I'm working on a project that needs to make use of FFTs on both Nvidia and AMD graphics cards. I initially looked for a library that would work on both (thinking this would be the OpenCL way) but I wasn't having any luck.
Someone suggested to me that I would have to use each vendor's FFT implementation and write a wrapper that chose what to do based on the platform. I found AMD's implementation pretty easily, but I'm actually working with an Nvidia card in the meantime (and this is the more important one for my particular application).
The only Nvidia implementation I can find is the CUFFT one. Does anyone know how I can actually use the CUFFT library from OpenCL? The only way I can think of is by having some CUDA code alongside my OpenCL code. I've read that I can't just use OpenCL buffers as CUDA pointers ( Trying to mix in OpenCL with CUDA in NVIDIA's SDK template ). Instead, would I have to copy the buffers back to the host after running OpenCL kernels and then copy them back to the GPU using the CUDA memory transfer routines? I don't really like this approach as it seems to involve pointless memory transfers, I would much prefer it if I could just use CUFFT from OpenCL.
NVIDIA has not done any work to support OpenCL libraries, like FFT. It also has not provided source to its CUDA libraries, so there is no way to run those using OpenCL.
AMD's FFT library is your best bet and will run on any other OpenCL-compliant device, including NVIDIA's GPUs. ArrayFire OpenCL leverages AMD's FFT library, and I've run that on Intel, NVIDIA, and AMD devices in our lab.
In addition to Ben's AMD suggestion, you could also investigate the Apple FFT example code. However, their code runs only on GPU devices as it checks for which device types the provided command queue was created for.
the SHOC benchmark on github also includes code that I have tested on nvidia GPU 650M, intel gpu, and intel CPU for FFT. on windows it takes a few minutes to create a project and set your include and link path but it was straightforward. running on the intel gpu requires setting the command line options or a small code modification since the intel gpu is device 1 not device 0 which is the default in the shoc benchmark suite.
i did not verify correctness of the output, only that it compiled and ran to completion.