Efficient 2D FFT on real input data?

I'm currently implementing a two dimensional FFT for real input data using opencl (more specifically a fast 2D convolution using FFTs, so I only need something which behaves similary enough to apply the convolution to). The 2D FFT is implemented using an 1D FFT on the rows and afterwards an 1D FFT on the cols.

To make this more efficient I'm trying to use the symmetries of FFTs with real input in order to be able to calculate smaller FFTs. I found that I can combine two rows into one, using the first as real component, the second as imaginary component, do the first 1D FFT on the resulting row and then use the symmetry properties to construct the results of the 1D FFTs of the individual rows from that. So what I'm doing is basically the following:

Let f and g be rows from the matrix.

Construct x = f + i * g
Transform to get F(x) = F(f) + i * F(g)
Use symmetries to extract F(f) and F(g) from F(x)

I can't however just input the results directly into the 2nd 1D FFT, because in that case I would not transform the whole matrix, but two submatrices instead. However extracting the data between the transformations means either storing more data (n/2+1 entries needed to express the result of an 1D FFT on real input) or combine the elements at index 0 and index n/2 into one element (combining using the same trick, since both numbers are guaranteed to be real) and use the same amount of storage but have to make a spcial case for that in my convolution.

Since I try to reuse buffers as much as possible (due to limited RAM availible on the gpu) using more storage isn't a nice solution. Furthermore my algorithms are not equipped to work on matrixsizes which are not power of 2 / multiples of 16 (varies from kernel to kernel). I would rather avoid introducing special cases either, since those would make my kernels more complex hurting efficiency (I'm already having trouble to minimize the register count used by each kernel).

So my question is if there is an elegant approach to this problem, meaning one which will work without either using more memory or special cases for certain elements?

Ideally I would like to be able to do the whole FFT without splitting my combined data in the middle of the FFT, but I'm not sure thats possible.

Hmmm...my two references are:

http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM http://images.apple.com/acg/pdf/FFTapps_20090909.pdf

I think that committing to a "hermitian" data structure, with the 0 and n/2 values packed into the first element is the way to go, as forward/inverse and hermitian structures will work out better.

That way, you have rUnWrap(FFT(n/2, Even(x) + i*Odd(x)))= rFFT(x), and the riFFT can work on the "hermitian" array, producing the pair of arrays Even and Odd, which again gives the original structure.

There are also other samplings that can be done, whereby the the original array is broken into 4 n/2xn/2 arrays, rooted at (0,0),(0,1),(1,0),(1,1) and then wrapped up at the end, using a final radix-4 pass...perhaps that is better for the GPU memory...I don't know.

alan