Compare preceding two rows with subsequent two row

2020-05-06 02:03发布

I had a question earlier which is deleted and now modified to a less verbose form for you to read easily.

I have a dataframe as given below

df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['fake_flag'] = ''

I would like to fill values in column fake_flag based on the below rules

1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5)

2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column

This is what I tried

for i in t1.index:
if i >=2:
    print("current value is  ", t1[i])
    print("preceding 1st (n-1) ", t1[i-1])
    print("preceding 2nd (n-2) ", t1[i-2])
    if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check
        r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
        print("rule 1 output is ", r1_output)
        if t1[i] >= r1_output + 3:
            print("found a value for rule 2", t1[i])
            print("check for next value is same as current value", t1[i+1])
            if (t1[i]==t1[i+1]): # rule 2 check
                print("fake flag is being set")
                df['fake_flag'][i] = 'fake_vac'

This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records.

I expect my output to be like as shown below

subject_id = 1

enter image description here

subject_id = 2

enter image description here

1条回答
够拽才男人
2楼-- · 2020-05-06 02:32
import pandas as pd
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})

df['shift1']=df['PEEP'].shift(1)
df['shift2']=df['PEEP'].shift(2)

df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '')
df.drop(['shift1','shift2'],axis=1)

Output

0   1   1   7   
1   1   2   5   
2   1   3   10  fake VAC
3   1   4   10  
4   1   5   11  fake VAC
5   1   6   11  
6   1   7   14  fake VAC
7   1   8   14  
8   1   9   17  fake VAC
9   1   10  17  
10  1   11  21  fake VAC
11  1   12  21  
12  1   13  23  fake VAC
13  1   14  23  
14  1   15  25  fake VAC
15  1   16  25  
16  1   17  22  fake VAC
17  1   18  20  fake VAC
18  1   19  26  fake VAC
19  1   20  26  
20  2   1   5   fake VAC
21  2   2   7   fake VAC
22  2   3   8   
23  2   4   8   
24  2   5   9   fake VAC
25  2   6   9   
26  2   7   13  fake VAC
查看更多
登录 后发表回答