So if I have something like x=np.random.rand(60000)*400-200
. iPython's %timeit
says:
x.astype(int)
takes 0.14msnp.rint(x)
andnp.around(x)
take 1.01ms
Note that in the rint
and around
cases you still need to spend the extra 0.14ms to do a final astype(int)
(assuming that's what you ultimately want).
Question: am I right in thinking that most modern hardware is capable of doing both operations in equal time. If so, why is numpy taking 8 times longer for the rounding?
As it happens I'm not super fussy about the exactness of the arithmetic, but I can't see how to take advantage of that with numpy (I'm doing messy biology not particle physics).
As pointed out by @jme in the comments, the
rint
andaround
functions must work out whether to round the fractions up or down to the nearest integer. In contrast, theastype
function will always round down so it can immediately discard the decimal information. There are a number of other functions that do the same thing. Also, you could improve the speed by using a lower number of bits for the integer. However you must be careful that you can accommodate the full range of your input data.Note, this does not store values outside the range -128 to 127 as it's 8-bit. Some values in your example fall outside this range.
Of all the others I tried,
np.intc
seems to be the fastest:Your examples, on my machine:
np.around(x).astype(int)
andx.astype(int)
don't produce the same values. The former rounds even (it's the same as((x*x>=0+0.5) + (x*x<0-0.5)).astype(int)
) whereas the latter rounds towards zero. However,shows
y==z
but calculatingy
is much slower. So it's thenp.trunc
andnp.around
functions which are slow.So
np.trunc(x)
rounds towards zero from double to double. Thenastype(int)
has to convert double to int64.Internally I don't know what python or numpy are doing but I know how I would do this in C. Let's discuss some hardware. With SSE4.1 it's possible to do round, floor, ceil, and trunc from double to double using:
but numpy needs to support systems without SSE4.1 as well so it would have to build without SSE4.1 as well as with SSE4.1 and then use a dispatcher.
But to do this from double directly to int64 using SSE/AVX is not efficient until AVX512. However, it is possible to round double to int32 efficiently using only SSE2:
These converts two doubles to two int64.
In your case this would work fine since the range is certainly within int32. But unless python knows the range fits in int32 it can't assume this so it would have to round or trunc to int64 which is slow. Also, once again numpy would have to build to support SSE2 to do this anyway.
But maybe you could have used a single floating point array to begin with. In that case you could have done:
These convert four singles to four int32.
So to answer your question SSE2 can round or truncated from double to int32 efficiently. AVX512 will be able to round or truncated from double to int64 efficiently as well using
_mm512_cvtpd_epi64(a)
or_mm512_cvttpd_epi64(a)
. SSE4.1 can round/trunc/floor/ceil from float to float or double to double efficiently.