group by on scipy sparse matrix

2019-09-11 15:25发布

I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. I want to group my sparse matrix following these keys and aggregate with a sum function.

Example:

Keys:
['foo','bar','foo','baz','baz','bar']

Sparse matrix:
(0,1) 3              -> corresponds to the first 'foo' key
(0,10) 4             -> corresponds to the first 'bar' key
(2,1) 1              -> corresponds to the second 'foo' key
(1,3) 2              -> corresponds to the first 'baz' key
(2,3) 10             -> corresponds to the second 'baz' key
(2,4) 1              -> corresponds to the second 'bar' key

Expected result:
{
    'foo': {1: 4},               -> 4 = 3 + 1
    'bar': {4: 1, 10: 4},        
    'baz': {3: 12}               -> 12 = 2 + 10
}

What is the more efficient way to do it?

I already tried to use pandas.SparseSeries.from_coo on my sparse matrix in order to be able to use pandas group by but I get this known bug:

site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    863         for obj in objs:
    864             if not isinstance(obj, NDFrame):
--> 865                 raise TypeError("cannot concatenate a non-NDFrame object")
    866 
    867             # consolidate

TypeError: cannot concatenate a non-NDFrame object

1条回答
在下西门庆
2楼-- · 2019-09-11 15:51

I can generate your target with basic dictionary and list operations:

keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
    d1 = dd.get(k, {})
    v = d1.get(cols[i], 0)
    d1[cols[i]] = v + data[i]
    dd[k] = d1
print dd

producing

{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}

I can generate a sparse matrix from this data as well with:

import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

But notice that the order of terms is not fixed. In the coo the order is as entered, but change format and the order changes. In other words the match between keys and the elements of the sparse matrix is unspecified.

  (0, 1)    3
  (0, 10)   4
  (2, 1)    1
  (1, 3)    2
  (2, 3)    10
  (2, 4)    1

  (0, 1)    3
  (1, 3)    2
  (2, 1)    1
  (2, 3)    10
  (0, 10)   4
  (2, 4)    1

Until you clear up this mapping, the initial dictionary approach is best.

查看更多
登录 后发表回答