I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. Can anyone else better the benchmark using Python and preferably Numpy?
def group():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,len(list(g))) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
which returns:
$ python bench.py
111.377498865
Cheers!
Edit based on responses:
def group_original():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,len(list(g))) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
def group_gnibbler():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,sum(1 for i in g)) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
def group_christophe():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
counts=values.searchsorted(values, side='right') - values.searchsorted(values, side='left')
index = np.zeros(len(values),dtype='u4,u2')
index['f0']=values
index['f1']=counts
#Erroneous result!
def group_paul():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
diff = np.concatenate(([1],np.diff(values)))
idx = np.concatenate((np.where(diff)[0],[len(values)]))
index = np.empty(len(idx)-1,dtype='u4,u2')
index['f0']=values[idx[:-1]]
index['f1']=np.diff(idx)
if __name__=='__main__':
from timeit import Timer
timings=[
("group_original","Original"),
("group_gnibbler","Gnibbler"),
("group_christophe","Christophe"),
("group_paul","Paul"),
]
for method,title in timings:
t = Timer("%s()"%method,"from __main__ import %s"%method)
print "%s: %s secs"%(title,t.timeit(number=1))
which returns:
$ python bench.py
Original: 113.385262966 secs
Gnibbler: 71.7464978695 secs
Christophe: 27.1690568924 secs
Paul: 9.06268405914 secs
Although Christophe gives incorrect results currently
I guess the most obvious and still not mentioned approach is, to simply use
collections.Counter
. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.
More than 5 years have passed since Paul's answer was accepted. Interestingly, the
sort()
is still the bottleneck in the accepted solution.The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.
Just by switching to radix sort, I get an overall 2.35x speedup. The radix sort is more than 4x faster than quicksort in this case.
See How to sort an array of integers faster than quicksort? that was motivated by your question.
In the latest version of numpy, we have this.
This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:
This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step
diff
approach is a bit opaque at first glance.i get a 3x improvement doing something like this:
You could try the following (ab)use of
scipy.sparse
:This is slower than the winning answer, perhaps because
scipy
currently doesn't support unsigned as indices types...