I would like to understand a strange behavior of python.
Let us consider a matrix M
with shape 6000 x 2000
. This matrix is filled with signed integers. I want to compute np.transpose(M)*M
. Two options:
- When I do it "naturally" (i.e. without specifying any typing), numpy selects the type
np.int32
and the operation takes around 150s. - When I force the type to be
np.float64
(usingdtype=...
), the same operation takes around 2s.
How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.
Thanks a lot for your help.
Some supplemental information that I found through experimentation.
This can be circumvented. Timings are on a intel cpu with intel mkl for BLAS. Im also using fortran ordered arrays to keep everything equivalent a the
scipy.linalg.blas
is the fortran BLAS.Lets take the following example:
First lets take the DGEMM calls.
Calling DGEMM/SGEMM directly:
Looks like something strange going on in the
np.dot
call. Similar to naive algorithm speed:No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure)
numpy
callsBLAS
routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions forint
matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.With regards to
int
vsfloat
speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge,PMUL***
(integer vector multiply) is 5 cycles and so are theMULP*
(floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8float
ops per instructions - I am not sure if there is anint
counterpart.This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf
That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.
I profiled following snippet:
and this is what oprofile says:
However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.
Edit: It seems numpy does provide compiled version of integer matrix multiply.