Comparatively slow python numpy 3D Fourier Transfo

For my work I need to perform discrete fourier transformations (DFTs) on large images. In the current example I require a 3D FT for a 1921 x 512 x 512 image (along with 2D FFTs of 512 x 512 images). Right now, I am using the numpy package and the associated function np.fft.fftn(). The code snippet below exemplarily shows 2D and 3D FFT times on an equal-sized/slightly smaller 2D/3D random-number-generated grid in the following way:

import sys
import numpy as np
import time

tas = time.time()
a = np.random.rand(512, 512)
tab = time.time()
b = np.random.rand(100, 512, 512)

tbfa = time.time()

fa = np.fft.fft2(a)
tfafb = time.time()
fb = np.fft.fftn(b)
tfbe = time.time()

print "initializing 512 x 512 grid:", tab - tas
print "initializing 100 x 512 x 512 grid:", tbfa - tab
print "2D FFT on 512 x 512 grid:", tfafb - tbfa
print "3D FFT on 100 x 512 x 512 grid:", tfbe - tfafb

Output:

initializing 512 x 512 grid: 0.00305700302124
initializing 100 x 512 x 512 grid: 0.301637887955
2D FFT on 512 x 512 grid: 0.0122730731964
3D FFT on 100 x 512 x 512 grid: 3.88418793678

The problem that I have is that I will need this process quite often, so the time spent per image should be short. When testing on my own computer (middle-segment laptop, 2GB RAM allocated to virtual machine (--> therefore smaller test grid)), as you can see the 3D FFT takes ~ 5 s (order-of-magnitude). Now, at work, the machines are way better, cluster/grid-architecture systems and FFTs are much faster. In both cases the 2D ones finish quasi instantaneously.

However with 1921x512x512, np.fft.fftn() takes ~ 5 min. Since I guess scipy's implementation is not much faster and considering that on MATLAB FFTs of same-sized grids finish within ~ 5 s, my question is whether there is a method to speed the process up to or almost to MATLAB times. My knowledge about FFTs is limited, but apparently MATLAB uses the FFTW algorithm, which python does not. Any reasonable chance that with some pyFFTW package I get similar times? Also, 1921 seems an unlucky choice, having only 2 prime factors (17, 113), so I assume this also plays a role. On the other hand 512 is a well-suited power of two. Are MATLAB-like times achievable if possible also without padding up with zeros to 2048?

I'm asking because I'll have to use FFTs a lot (to an amount where such differences will be of huge influence!) and in case there is no possibility to reduce computation times in python, I'd have to switch to other, faster implementations.

Yes, there is a chance that using FFTW through the interface pyfftw will reduce your computation time compared to numpy.fft or scipy.fftpack. The performances of these implementations of DFT algorithms can be compared in benchmarks such as this one : some interesting results are reported in Improving FFT performance in Python

I suggest the following code for a test:

import pyfftw
import numpy
import time
import scipy

f = pyfftw.n_byte_align_empty((127,512,512),16, dtype='complex128')
#f = pyfftw.empty_aligned((33,128,128), dtype='complex128', n=16)
f[:] = numpy.random.randn(*f.shape)

# first call requires more time for plan creation
# by default, pyfftw use FFTW_MEASURE for the plan creation, which means that many 3D dft are computed so as to choose the fastest algorithm.
fftf=pyfftw.interfaces.numpy_fft.fftn(f)

#help(pyfftw.interfaces)
tas = time.time()
fftf=pyfftw.interfaces.numpy_fft.fftn(f) # here the plan is applied, nothing else.
tas = time.time()-tas
print "3D FFT, pyfftw:", tas

f = pyfftw.n_byte_align_empty((127,512,512),16, dtype='complex128')
#f = pyfftw.empty_aligned((33,128,128), dtype='complex128', n=16)
f[:] = numpy.random.randn(*f.shape)


tas = time.time()
fftf=numpy.fft.fftn(f)
tas = time.time()-tas
print "3D FFT, numpy:", tas

tas = time.time()
fftf=scipy.fftpack.fftn(f)
tas = time.time()-tas
print "3D FFT, scipy/fftpack:", tas

# first call requires more time for plan creation
# by default, pyfftw use FFTW_MEASURE for the plan creation, which means that many 3D dft are computed so as to choose the fastest algorithm.
f = pyfftw.n_byte_align_empty((128,512,512),16, dtype='complex128')
fftf=pyfftw.interfaces.numpy_fft.fftn(f)

tas = time.time()
fftf=pyfftw.interfaces.numpy_fft.fftn(f) # here the plan is applied, nothing else.
tas = time.time()-tas
print "3D padded FFT, pyfftw:", tas

For a size of 127*512*512, on my modest computer, I got:

3D FFT, pyfftw: 3.94130897522
3D FFT, numpy: 16.0487070084
3D FFT, scipy/fftpack: 19.001199007
3D padded FFT, pyfftw: 2.55221295357

So pyfftw is significantly faster than numpy.fft and scipy.fftpack. Using padding is even faster, but the thing that is computed is different.

Lastly, pyfftw may seem slower at the first run due to the fact that it uses the flag FFTW_MEASURE according to the documentation. It's a good thing if and only if many DFTs of the same size are successively computed.