大熊猫:迅速计算与一定值列的总和(pandas: rapidly calculating sum o

2019-09-25 19:48发布

我有一个熊猫数据帧,我需要计算在某个窗口内的一列值的总和。 因此,举例来说,如果我有500一扇窗户,我的初始值是1000,我想总结的是499和999之间,并且还1001和1501之间的所有值。

这是更容易一些数据来说明:

    chrom   pos end AFR EUR pi
0   1   10177   10177   0.4909  0.4056  0.495988
1   1   10352   10352   0.4788  0.4264  0.496369
2   1   10617   10617   0.9894  0.9940  0.017083
3   1   11008   11008   0.1346  0.0885  0.203142
4   1   11012   11012   0.1346  0.0885  0.203142
5   1   13110   13110   0.0053  0.0567  0.053532
6   1   13116   13116   0.0295  0.1869  0.176091
7   1   13118   13118   0.0295  0.1869  0.176091
8   1   13273   13273   0.0204  0.1471  0.139066
9   1   13550   13550   0.0008  0.0080  0.007795
10  1   14464   14464   0.0144  0.1859  0.161422
11  1   14599   14599   0.1210  0.1610  0.238427
12  1   14604   14604   0.1210  0.1610  0.238427
13  1   14930   14930   0.4811  0.5209  0.500209
14  1   14933   14933   0.0015  0.0507  0.044505
15  1   15211   15211   0.5371  0.7316  0.470848
16  1   15585   15585   0.0008  0.0020  0.002635
17  1   15644   15644   0.0008  0.0080  0.007795
18  1   15777   15777   0.0159  0.0149  0.030470
19  1   15820   15820   0.4849  0.2714  0.477153
20  1   15903   15903   0.0431  0.4652  0.349452
21  1   16071   16071   0.0091  0.0010  0.011142
22  1   16142   16142   0.0053  0.0020  0.007721
23  1   16949   16949   0.0227  0.0159  0.038759
24  1   18643   18643   0.0023  0.0080  0.009485
25  1   18849   18849   0.8411  0.9911  0.170532
26  2   30923   30923   0.6687  0.9364  0.338400
27  2   20286   46286   0.0053  0.0010  0.006863
28  2   21698   46698   0.0015  0.0010  0.002566
29  2   42159   47159   0.0083  0.0696  0.067187

所以,我需要根据前两列的子集。 例如,如果我的窗口= 500,我的CHROM = 1和我的POS = 15500,我将需要子集我的DF,以仅包括具有CHROM = 1和15000> POS <16000的那些行。

那么我想总结数据的这个子集的AFR列。

下面是我所做的功能:

#vdf is my main dataframe, 
#polyChrom is the chromosome to subset by,     
#polyPos is the position to subset by. 
#Distance is how far the window should be from the polyPos. 
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is 
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):

#If start position becomes less than 0, set it to 0
    if(polyPos - distance < 0):
        start1 = 0
        end1 = windowSize
    else:
        start1 = polyPos - distance
        end1 = start1 + windowSize

    end2 = polyPos + distance
    start2 = end2 - windowSize

#subset df        
    df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
                 ((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()

    return(df.AFR.sum())

这整个方法适用于子集划分的数据框,是当我的数据帧中包含〜55K行非常缓慢。 是否有这样做的更快和更有效的方式?

Answer 1:

关键是要下降到numpy的阵列。 熊猫索引和切片是缓慢的。

import pandas as pd

df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
                   [2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])

chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values

def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
    return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
               if pos_start < j < pos_end and i == chrom_val)

filter_sum(chrom, pos, afr, 1, 10150, 10200)

# 1.1


文章来源: pandas: rapidly calculating sum of column with certain values