count rows by certain combination of row values pa

I have a dataframe (df) like this:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375
   0    -5  -2.5
   0    -5  -1.25
   0    -5  -1.875

The rows are in the same chunk if with certain/same v1 and v2. In this case, rows with([0,-30], [0,-10], [0,-5]). I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.

My rough codes:

v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls: 
   chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]

   if len(chunk[i])!= 6:
      df = df[df != chunk[i]]
   else:
      pass

expected output:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375

Thanks!

标签： python pandas dataframe count row

3条回答

姐就是有狂的资本

2楼-- · 2019-01-27 02:30

Use groupby + count/size:

df.groupby(['v1', 'v2']).v3.transform('count')

0     6.0
1     6.0
2     6.0
3     6.0
4     6.0
5     6.0
6     6.0
7     6.0
8     6.0
9     6.0
10    6.0
11    6.0
12    3.0
13    3.0
14    3.0
Name: v3, dtype: float64

Use the mask to filter df:

df = df[df.groupby(['v1', 'v2']).v3.transform('count').eq(6)]    # == 6
df

    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

count does not count NaNs, while size does. Use whatever is appropriate for you.

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-01-27 02:38

You can use the filter groupby method:

In [11]: df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
Out[11]:
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-01-27 02:49

I think in v1 and v2 are no NaNs, so use transform + size:

df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

Detail:

print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14    False
Name: v2, dtype: bool

Unfortunately filter is really slow, so if need better performance use transform:

np.random.seed(123)
N = 1000000
L = list('abcdefghijkl') 
df = pd.DataFrame({'v1': np.random.choice(L, N),
                   'v2':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))

In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop

In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop

In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop

N = 1000000

ngroups = 1000

df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))

In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop

In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

0人赞添加讨论(0) 举报

count rows by certain combination of row values pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间