count rows by certain combination of row values pa

2019-01-27 02:07发布

I have a dataframe (df) like this:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375
   0    -5  -2.5
   0    -5  -1.25
   0    -5  -1.875

The rows are in the same chunk if with certain/same v1 and v2. In this case, rows with([0,-30], [0,-10], [0,-5]). I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.

My rough codes:

v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls: 
   chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]

   if len(chunk[i])!= 6:
      df = df[df != chunk[i]]
   else:
      pass

expected output:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375

Thanks!

3条回答
姐就是有狂的资本
2楼-- · 2019-01-27 02:30

Use groupby + count/size:

df.groupby(['v1', 'v2']).v3.transform('count')

0     6.0
1     6.0
2     6.0
3     6.0
4     6.0
5     6.0
6     6.0
7     6.0
8     6.0
9     6.0
10    6.0
11    6.0
12    3.0
13    3.0
14    3.0
Name: v3, dtype: float64

Use the mask to filter df:

df = df[df.groupby(['v1', 'v2']).v3.transform('count').eq(6)]    # == 6
df

    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

count does not count NaNs, while size does. Use whatever is appropriate for you.

查看更多
做个烂人
3楼-- · 2019-01-27 02:38

You can use the filter groupby method:

In [11]: df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
Out[11]:
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375
查看更多
Deceive 欺骗
4楼-- · 2019-01-27 02:49

I think in v1 and v2 are no NaNs, so use transform + size:

df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

Detail:

print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14    False
Name: v2, dtype: bool

Unfortunately filter is really slow, so if need better performance use transform:

np.random.seed(123)
N = 1000000
L = list('abcdefghijkl') 
df = pd.DataFrame({'v1': np.random.choice(L, N),
                   'v2':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))

In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop

In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop

In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop

N = 1000000

ngroups = 1000

df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))

In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop

In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

查看更多
登录 后发表回答