Fastest way to extract dictionary of sums in numpy

2019-04-26 19:09发布

Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

4条回答
家丑人穷心不美
2楼-- · 2019-04-26 19:14

Here's a NumPy based approach using np.add.reduceat -

sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)

If you would like to get the keys and values in a 2-column array -

out = np.column_stack((keys,vals)) # If you 

Sample run -

In [351]: arr
Out[351]: 
array([[ 1, 20,  5],
       [ 1, 20,  8],
       [ 3, 10,  4],
       [ 2, 30,  6],
       [ 3, 10,  5]])

In [352]: out
Out[352]: 
array([[ 1, 13],
       [ 2,  6],
       [ 3,  9]])
查看更多
一夜七次
3楼-- · 2019-04-26 19:15

The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:

cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
            if cum_sum != 0}

where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.

Alternatively you can do things properly and call np.unique first and do:

uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}
查看更多
等我变得足够好
4楼-- · 2019-04-26 19:21

This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)

import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])

Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.

查看更多
▲ chillily
5楼-- · 2019-04-26 19:32

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

enter image description here

查看更多
登录 后发表回答