CUFFT | cannot figure out a simple example

I've been struggling the whole day, trying to make a basic CUFFT example work properly. However i run into a little problem which I cannot identify. Basically I have a linear 2D array vx with x and y coordinates. Then I just calculate a forward then backward CUFFT (in-place), that simple. Then I copy back the array vx, normalize it by NX*NY , then display.

#define NX 32
#define NY 32
#define LX (2*M_PI)
#define LY (2*M_PI)
float *x = new float[NX*NY];
float *y = new float[NX*NY];
float *vx = new float[NX*NY];
for(int j = 0; j < NY; j++){
    for(int i = 0; i < NX; i++){
        x[j*NX + i] = i * LX/NX;
        y[j*NX + i] = j * LY/NY;
        vx[j*NX + i] = cos(x[j*NX + i]);
    }
}
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(float), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, (cufftComplex *)d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, (cufftComplex *)d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftReal), cudaMemcpyDeviceToHost));
for (int j = 0; j < NY; j++){
    for (int i = 0; i < NX; i++){
        printf("%.3f ", vx[j*NX + i]/(NX*NY));
    }
    printf("\n");
}

When vx is defined as cos(x) or sin(x), it works fine, but when using sin(y) or cos(y), it gives me back the correct function (sin or cos) but with half amplitude (that is, oscillating between 0.5 and -0.5 instead of 1 and -1) ! Note that using sin(2*y) or cos(2*y) (or sin(4*y), cos(4*y), ...) works fine. Any idea?

The problem here is that input and output of an in-place real to complex transform is a complex type whose size isn't the same as the input real data (it is twice as large). You haven't allocated enough memory to hold the intermediate complex results of the real to complex transform. Quoting from the documentation:

cufftExecR2C() (cufftExecD2Z()) executes a single-precision (double-precision) real-to-complex, implicitly forward, CUFFT transform plan. CUFFT uses as input data the GPU memory pointed to by the idata parameter. This function stores the nonredundant Fourier coefficients in the odata array. Pointers to idata and odata are both required to be aligned to cufftComplex data type in single-precision transforms and cufftDoubleComplex data type in double-precision transforms.

The solution is either to allocate a second device buffer to hold the intermediate result or enlarge the in place allocation so it is large enough to hold the complex data. So the core transform code changes to something like:

float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(cufftComplex)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftComplex), cudaMemcpyDeviceToHost));

[disclaimer: written in browser, never compiled or tested, use at own risk]

Note you will need to adjust the host code to match the size and type of the input and data.

As a final comment, would it have been that hard to add the additional 8 or 10 lines required to turn what you posted into a compilable, runnable example that someone trying to help you could work with?