I have the following dataframe df:
data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 2
1 1 2
2 1 3
3 1 2
4 1 2
5 1 2
6 1 3
7 1 3
8 1 3
9 1 3
10 2 1
11 2 4
12 2 1
13 2 1
14 2 1
15 2 4
16 2 4
17 2 1
18 2 1
19 2 1
20 2 1
21 2 1
What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.
I would like to have the following result for the above:
df
Out[12]:
id value flag
0 1 2 0
1 1 2 0
2 1 3 0
3 1 2 1
4 1 2 1
5 1 2 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1
10 2 1 0
11 2 4 0
12 2 1 1
13 2 1 1
14 2 1 1
15 2 4 0
16 2 4 0
17 2 1 1
18 2 1 1
19 2 1 1
20 2 1 1
21 2 1 1
I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:
test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)
And the result here:
test
Out[61]:
id value rma flag
0 1 2 NaN 0
1 1 2 NaN 0
2 1 3 2.333333 0
3 1 2 2.333333 0
4 1 2 2.333333 0
5 1 2 2.000000 1
6 1 3 2.333333 0
7 1 3 2.666667 0
8 1 3 3.000000 1
9 1 3 3.000000 1
10 2 1 NaN 0
11 2 4 NaN 0
12 2 1 2.000000 0
13 2 1 2.000000 0
14 2 1 1.000000 1
15 2 4 2.000000 0
16 2 4 3.000000 0
17 2 1 3.000000 0
18 2 1 2.000000 0
19 2 1 1.000000 1
20 2 1 1.000000 1
21 2 1 1.000000 1
Can't wait to see what I am missing! Thanks
You can try this; 1) Create an extra group variable with
df.value.diff().ne(0).cumsum()
to denote the value changes; 2) usetransform('size')
to calculate the group size and compare with three, then you get theflag
column you need:Break downs:
1)
diff
is not equal to zero (which is literally whatdf.value.diff().ne(0)
means) gives a conditionTrue
whenever there is a value change:2) Then
cumsum
gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values,True
is considered as one whileFalse
is considered as zero:3) combined with
id
column, you can group the data frame, calculate the group size and get theflag
column.See EDIT2 for a more robust solution
Same result, but a little bit faster:
Where:
df.value != df.value.shift()
gives the value changecumsum()
creates "labels" for each group of same valuelabels.value_counts()
counts the occurrences of each labellabels.map(...)
replaces labels by the counts computed above>= 3
creates a boolean mask on count valueastype(int)
casts the booleans to intIn my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach. But mine is not one-liner.
EDIT:
A mix between both approaches is even faster
Gives 911µs with your sample df.
EDIT2: correct solution to account for id change, as pointed by @clg4
Where
... | df.id.diff().ne(0)
increment the label where the id changesThis works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms
EDIT3: Better explanations
Take the case where index 10 has value 3.
df.id.diff().ne(0)
The
|
is operator "bitwise-or", which givesTrue
as long as one of the elements isTrue
. So if there is no diff in value where the id changes, the|
reflects the id change. Otherwise it changes nothing. When.cumsum()
is performed, the label is incremented where the id changes, so the value3
at index 10 is not grouped with values3
from indexes 6-9.