Aggregate all dataframe row pair combinations usin

I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical combination). Here is the example data, where I would like to look at all pairs of genes in [mygenes]:

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

The final product should look like this (applying np.sum by default is fine):

                 case1    case2    control1    control2
'ABC1', 'ABC2'    1         2         0            1
'ABC1', 'ABC3'    1         2         1            1
'ABC1', 'ABC4'    0         1         1            2
'ABC2', 'ABC3'    2         2         1            0
'ABC2', 'ABC4'    1         1         1            1
'ABC3', 'ABC4'    1         1         2            1

The set of gene pairs can be easily obtained with itertools ($itertools.combinations(mygenes, 2)), but I can't figure out how to perform aggregation of specific rows based on their values. Can anyone advise? Thank you

标签： python pandas aggregate combinations itertools

2条回答

孤傲高冷的网名

2楼-- · 2019-05-06 21:57

I can't think of a clever vectorized way to do this, but unless performance is a real bottleneck I tend to use the simplest thing which makes sense. In this case, I might set_index("Gene") and then use loc to pick out the rows:

>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2019-05-06 22:03

Before going too far, you should keep in mind your data gets big pretty fast. With 5 rows, output will be C(5,2) or 5+4+3+2+1 and so on.

That said, I'd think about doing this in numpy for speed (you may want to add a numpy tag to your question btw). Anyway, this isn't as vectorized as it might be, but ought to be a start at least:

df2 = df.set_index('Gene').loc[mygenes].reset_index()

import math
sz = len(df2)
sz2 = math.factorial(sz) / ( math.factorial(sz-2) * 2 )

Gene = df2['Gene'].tolist()
abc = df2.ix[:,1:].values

import math
arr = np.zeros([sz2,4])
gene2 = []
k = 0

for i in range(sz):
    for j in range(sz):
        if i != j and i < j:
            gene2.append( gene[i] + gene[j] )
            arr[k] = abc[i] + abc[j]
            k += 1

pd.concat( [ pd.DataFrame(gene2), pd.DataFrame(arr) ], axis=1 )
Out[1780]: 
          0  0  1  2  3
0  ABC1ABC2  1  2  0  1
1  ABC1ABC3  1  2  1  1
2  ABC1ABC4  0  1  1  2
3  ABC2ABC3  2  2  1  0
4  ABC2ABC4  1  1  1  1
5  ABC3ABC4  1  1  2  1

Depending on size/speed issues you may need to separate the string and numerical code and vectorize the numerical piece. This code is not likely to scale all that well if your data is big and if it is, that may determine what sort of answer you need (and also may need to think about how you store results).

0人赞添加讨论(0) 举报

Aggregate all dataframe row pair combinations usin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间