I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. I want to group my sparse matrix following these keys and aggregate with a sum function.
Example:
Keys:
['foo','bar','foo','baz','baz','bar']
Sparse matrix:
(0,1) 3 -> corresponds to the first 'foo' key
(0,10) 4 -> corresponds to the first 'bar' key
(2,1) 1 -> corresponds to the second 'foo' key
(1,3) 2 -> corresponds to the first 'baz' key
(2,3) 10 -> corresponds to the second 'baz' key
(2,4) 1 -> corresponds to the second 'bar' key
Expected result:
{
'foo': {1: 4}, -> 4 = 3 + 1
'bar': {4: 1, 10: 4},
'baz': {3: 12} -> 12 = 2 + 10
}
What is the more efficient way to do it?
I already tried to use pandas.SparseSeries.from_coo
on my sparse matrix in order to be able to use pandas group by but I get this known bug:
site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
863 for obj in objs:
864 if not isinstance(obj, NDFrame):
--> 865 raise TypeError("cannot concatenate a non-NDFrame object")
866
867 # consolidate
TypeError: cannot concatenate a non-NDFrame object
I can generate your target with basic dictionary and list operations:
producing
I can generate a sparse matrix from this data as well with:
But notice that the order of terms is not fixed. In the
coo
the order is as entered, but change format and the order changes. In other words the match betweenkeys
and the elements of the sparse matrix is unspecified.Until you clear up this mapping, the initial dictionary approach is best.