count rows by certain combination of row values pa

2019-01-27 01:53发布

问题:

I have a dataframe (df) like this:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375
   0    -5  -2.5
   0    -5  -1.25
   0    -5  -1.875

The rows are in the same chunk if with certain/same v1 and v2. In this case, rows with([0,-30], [0,-10], [0,-5]). I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.

My rough codes:

v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls: 
   chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]

   if len(chunk[i])!= 6:
      df = df[df != chunk[i]]
   else:
      pass

expected output:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375

Thanks!

回答1:

I think in v1 and v2 are no NaNs, so use transform + size:

df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

Detail:

print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14    False
Name: v2, dtype: bool

Unfortunately filter is really slow, so if need better performance use transform:

np.random.seed(123)
N = 1000000
L = list('abcdefghijkl') 
df = pd.DataFrame({'v1': np.random.choice(L, N),
                   'v2':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))

In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop

In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop

In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop

N = 1000000

ngroups = 1000

df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))

In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop

In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.



回答2:

You can use the filter groupby method:

In [11]: df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
Out[11]:
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375


回答3:

Use groupby + count/size:

df.groupby(['v1', 'v2']).v3.transform('count')

0     6.0
1     6.0
2     6.0
3     6.0
4     6.0
5     6.0
6     6.0
7     6.0
8     6.0
9     6.0
10    6.0
11    6.0
12    3.0
13    3.0
14    3.0
Name: v3, dtype: float64

Use the mask to filter df:

df = df[df.groupby(['v1', 'v2']).v3.transform('count').eq(6)]    # == 6
df

    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

count does not count NaNs, while size does. Use whatever is appropriate for you.