I have one dataframe as below. At first,they have three columns('date','time','flag'). I want to add one column which based on the flag and date which means when I get flag=1 ,then the rest of this day the target is 1, otherwise the target is zero.
date time flag target
0 2017/4/10 10:00:00 0 0
1 2017/4/10 11:00:00 1 1
2 2017/4/10 12:00:00 0 1
3 2017/4/10 13:00:00 0 1
4 2017/4/10 14:00:00 0 1
5 2017/4/11 10:00:00 1 1
6 2017/4/11 11:00:00 0 1
7 2017/4/11 12:00:00 1 1
8 2017/4/11 13:00:00 1 1
9 2017/4/11 14:00:00 0 1
10 2017/4/12 10:00:00 0 0
11 2017/4/12 11:00:00 0 0
12 2017/4/12 12:00:00 0 0
13 2017/4/12 13:00:00 0 0
14 2017/4/12 14:00:00 0 0
15 2017/4/13 10:00:00 0 0
16 2017/4/13 11:00:00 1 1
17 2017/4/13 12:00:00 0 1
18 2017/4/13 13:00:00 1 1
19 2017/4/13 14:00:00 0 1
Use DataFrameGroupBy.cumsum
for cumulative sum flag
values, compare with 0
and last cast mask to integer
:
df['new'] = (df.groupby('date')['flag'].cumsum() > 0).astype(int)
print (df)
date time flag target new
0 2017/4/10 10:00:00 0 0 0
1 2017/4/10 11:00:00 1 1 1
2 2017/4/10 12:00:00 0 1 1
3 2017/4/10 13:00:00 0 1 1
4 2017/4/10 14:00:00 0 1 1
5 2017/4/11 10:00:00 1 1 1
6 2017/4/11 11:00:00 0 1 1
7 2017/4/11 12:00:00 1 1 1
8 2017/4/11 13:00:00 1 1 1
9 2017/4/11 14:00:00 0 1 1
10 2017/4/12 10:00:00 0 0 0
11 2017/4/12 11:00:00 0 0 0
12 2017/4/12 12:00:00 0 0 0
13 2017/4/12 13:00:00 0 0 0
14 2017/4/12 14:00:00 0 0 0
15 2017/4/13 10:00:00 0 0 0
16 2017/4/13 11:00:00 1 1 1
17 2017/4/13 12:00:00 0 1 1
18 2017/4/13 13:00:00 1 1 1
19 2017/4/13 14:00:00 0 1 1
Okay, I know that we've already found a solution here but just to satisfy the nerd in me, here's an answer (not elegant given how long it is) to avoid that nagging first-row flaw
pd.merge(df, (df.groupby('date')['flag'].any().astype(int)).to_frame().T.transpose().reset_index(), left_on='date', right_on='date')
Approach remains the same as @jezrael - the groupby function is key here. Instead of using the cumsum, which leads to the first-row flaw, any()
appears to fit really well into this solution. The only drawback is that it produces a series, which we then need to coerce back into a dataframe and transpose before joining them together by the date key.