Fastest way to extract dictionary of sums in numpy

Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

标签： python numpy pandas dictionary vectorization

4条回答

家丑人穷心不美

2楼-- · 2019-04-26 19:14

Here's a NumPy based approach using np.add.reduceat -

sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)

If you would like to get the keys and values in a 2-column array -

out = np.column_stack((keys,vals)) # If you

Sample run -

In [351]: arr
Out[351]: 
array([[ 1, 20,  5],
       [ 1, 20,  8],
       [ 3, 10,  4],
       [ 2, 30,  6],
       [ 3, 10,  5]])

In [352]: out
Out[352]: 
array([[ 1, 13],
       [ 2,  6],
       [ 3,  9]])

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-04-26 19:15

The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:

cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
            if cum_sum != 0}

where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.

Alternatively you can do things properly and call np.unique first and do:

uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}

0人赞添加讨论(0) 举报

等我变得足够好

4楼-- · 2019-04-26 19:21

This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)

import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])

Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.

0人赞添加讨论(0) 举报

▲ chillily

5楼-- · 2019-04-26 19:32

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

0人赞添加讨论(0) 举报

Fastest way to extract dictionary of sums in numpy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间