I'm interested in converting a numpy array into a sparse dictionary as quickly as possible. Let me elaborate:
Given the array:
numpy.array([12,0,0,0,3,0,0,1])
I wish to produce the dictionary:
{0:12, 4:3, 7:1}
As you can see, we are simply converting the sequence type into an explicit mapping from indices that are nonzero to their values.
In order to make this a bit more interesting, I offer the following test harness to try out alternatives:
from timeit import Timer
if __name__ == "__main__":
s = "import numpy; from itertools import izip; from numpy import nonzero, flatnonzero; vector = numpy.random.poisson(0.1, size=10000);"
ms = [ "f = flatnonzero(vector); dict( zip( f, vector[f] ) )"
, "f = flatnonzero(vector); dict( izip( f, vector[f] ) )"
, "f = nonzero(vector); dict( izip( f[0], vector[f] ) )"
, "n = vector > 0; i = numpy.arange(len(vector))[n]; v = vector[n]; dict(izip(i,v))"
, "i = flatnonzero(vector); v = vector[vector > 0]; dict(izip(i,v))"
, "dict( zip( flatnonzero(vector), vector[flatnonzero(vector)] ) )"
, "dict( zip( flatnonzero(vector), vector[nonzero(vector)] ) )"
, "dict( (i, x) for i,x in enumerate(vector) if x > 0);"
]
for m in ms:
print " %.2fs" % Timer(m, s).timeit(1000), m
I'm using a poisson distribution to simulate the sort of arrays I am interested in converting.
Here are my results so far:
0.78s f = flatnonzero(vector); dict( zip( f, vector[f] ) )
0.73s f = flatnonzero(vector); dict( izip( f, vector[f] ) )
0.71s f = nonzero(vector); dict( izip( f[0], vector[f] ) )
0.67s n = vector > 0; i = numpy.arange(len(vector))[n]; v = vector[n]; dict(izip(i,v))
0.81s i = flatnonzero(vector); v = vector[vector > 0]; dict(izip(i,v))
1.01s dict( zip( flatnonzero(vector), vector[flatnonzero(vector)] ) )
1.03s dict( zip( flatnonzero(vector), vector[nonzero(vector)] ) )
4.90s dict( (i, x) for i,x in enumerate(vector) if x > 0);
As you can see, the fastest solution I have found is
n = vector > 0;
i = numpy.arange(len(vector))[n]
v = vector[n]
dict(izip(i,v))
Any faster way?
Edit: The step
i = numpy.arange(len(vector))[n]
Seems particularly clumsy- generating an entire array before selecting only some elements, particularly when we know it might only be around 1/10 of the elements getting selected. I think this might still be improved.
The following seems to be a significant improvement:
Timing:
I also tried
scipy.sparse.dok_matrix
andpandas.DataFrame.to_dict
but in my testing they were slower than the original.You could use
np.unique
withreturn_index=True
:It's generally faster to iterate over
list
s thannumpy.array
s so you could be faster when you convert them to lists withtolist
:Timing
For short arrays this approach may be slower but it's according to my timings faster than the other approaches for big arrays:
use the sparse matrix in scipy as bridge:
Tried this?
from numpy import where
i = where(vector > 0)[0]