我有一个熊猫数据帧,我需要计算在某个窗口内的一列值的总和。 因此,举例来说,如果我有500一扇窗户,我的初始值是1000,我想总结的是499和999之间,并且还1001和1501之间的所有值。
这是更容易一些数据来说明:
chrom pos end AFR EUR pi
0 1 10177 10177 0.4909 0.4056 0.495988
1 1 10352 10352 0.4788 0.4264 0.496369
2 1 10617 10617 0.9894 0.9940 0.017083
3 1 11008 11008 0.1346 0.0885 0.203142
4 1 11012 11012 0.1346 0.0885 0.203142
5 1 13110 13110 0.0053 0.0567 0.053532
6 1 13116 13116 0.0295 0.1869 0.176091
7 1 13118 13118 0.0295 0.1869 0.176091
8 1 13273 13273 0.0204 0.1471 0.139066
9 1 13550 13550 0.0008 0.0080 0.007795
10 1 14464 14464 0.0144 0.1859 0.161422
11 1 14599 14599 0.1210 0.1610 0.238427
12 1 14604 14604 0.1210 0.1610 0.238427
13 1 14930 14930 0.4811 0.5209 0.500209
14 1 14933 14933 0.0015 0.0507 0.044505
15 1 15211 15211 0.5371 0.7316 0.470848
16 1 15585 15585 0.0008 0.0020 0.002635
17 1 15644 15644 0.0008 0.0080 0.007795
18 1 15777 15777 0.0159 0.0149 0.030470
19 1 15820 15820 0.4849 0.2714 0.477153
20 1 15903 15903 0.0431 0.4652 0.349452
21 1 16071 16071 0.0091 0.0010 0.011142
22 1 16142 16142 0.0053 0.0020 0.007721
23 1 16949 16949 0.0227 0.0159 0.038759
24 1 18643 18643 0.0023 0.0080 0.009485
25 1 18849 18849 0.8411 0.9911 0.170532
26 2 30923 30923 0.6687 0.9364 0.338400
27 2 20286 46286 0.0053 0.0010 0.006863
28 2 21698 46698 0.0015 0.0010 0.002566
29 2 42159 47159 0.0083 0.0696 0.067187
所以,我需要根据前两列的子集。 例如,如果我的窗口= 500,我的CHROM = 1和我的POS = 15500,我将需要子集我的DF,以仅包括具有CHROM = 1和15000> POS <16000的那些行。
那么我想总结数据的这个子集的AFR列。
下面是我所做的功能:
#vdf is my main dataframe,
#polyChrom is the chromosome to subset by,
#polyPos is the position to subset by.
#Distance is how far the window should be from the polyPos.
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):
#If start position becomes less than 0, set it to 0
if(polyPos - distance < 0):
start1 = 0
end1 = windowSize
else:
start1 = polyPos - distance
end1 = start1 + windowSize
end2 = polyPos + distance
start2 = end2 - windowSize
#subset df
df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()
return(df.AFR.sum())
这整个方法适用于子集划分的数据框,是当我的数据帧中包含〜55K行非常缓慢。 是否有这样做的更快和更有效的方式?