Pandas Find Sequence or Pattern in Column

2019-02-26 05:19发布

问题:

Here's some example data for the problem I'm working on:

index     Quarter    Sales_Growth
0          2001q1    0
1          2002q2    0
2          2002q3    1
3          2002q4    0
4          2003q1    0
5          2004q2    0
6          2004q3    1
7          2004q4    1

The Sales_Growth column tells me if there was indeed sales growth in the quarter or not. 0 = no growth, 1 = growth.

First, I'm trying to return the first Quarter when there were two consecutive quarters of no sales growth.

With the data above this answer would be 2001q1.

Then, I want to return the 2nd quarter of consecutive sales growth that occurs AFTER the initial two quarters of no growth.

The answer to this question would be 2004q4.

I've searched and searched but the closest answer I can find I can't get to work: https://stackoverflow.com/a/26539166/3225420

Thanks in advance for helping a Pandas newbie, I'm hacking away as best I can but stuck on this one.

回答1:

You're doing subsequence matching. This is a bit strange, but bear with me:

growth = df.Sales_Growth.astype(str).str.cat()

That gives you:

'00100011'

Then:

growth.index('0011')

Gives you 4 (obviously you'd add a constant 3 to get the index of the last row matched by the pattern).

I feel this approach starts off a bit ugly, but the end result is really usable--you can search for any fixed pattern with no additional coding.



回答2:

For Q1:

temp = df.Sales_Growth + df.Sales_Growth.shift(-1)
df[temp == 0].head(1)

For Q2:

df[(df.Sales_Growth == 1) & (df.Sales_Growth.shift(1) == 1) & (df.Sales_Growth.shift(2) == 0) & (df.Sales_Growth.shift(3) == 0)].head(1)


回答3:

Building on the earlier answers. Q1:

temp = df.Sales_Growth.rolling_apply(window=2, min_periods=2, \
    kwargs={pattern: [0,0]}, func=lambda x, pattern: x == pattern)
print(df[temp==1].head())

In the rolling_apply call, window and min_periods must match the length of the pattern list being passed to the rolling_apply function.

Q2: Same approach, different pattern:

temp = df.Sales_Growth.rolling_apply(window=4, min_periods=4, \
    kwargs={pattern: [0,0,1,1]}, func=lambda x, pattern: x == pattern)
print(df[temp==1].head())