Slicing Pandas Dataframe according to number of li

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.

Suppose I have a dataframe df loking like this :

Group   Id_In_Group   SomeQuantity
1        1              10
1        2              20
2        1               7
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28
4        1               1
4        2              18
4        3              14
4        4               7
5        1              36

I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).

In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :

Group   Id_In_Group   SomeQuantity
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28

(being or not sorted by SomeQuantity, whatever).

Could somebody be kind enough to help me? :)

标签： python pandas dataframe slice

3条回答

ら.Afraid

2楼-- · 2019-06-28 06:43

This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.

Find groups with "4 objects in the group"

import collections

groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups

Out:[3, 4]

Use these groups to filter to a new df containing these groups:

df2 = df[df.Group.isin(groups)]

"4th SomeQuantity (after sorting) is 22 (>=20)"

 df3 = df2.sort_values(by='SomeQuantity',ascending=False)

(Updated as per comment below...)

df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()

    Group   Id_In_Group SomeQuantity
  3    3        1       16
  4    3        2       22
  5    3        3       5
  6    3        4       12
  7    3        5       28

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2019-06-28 06:45

I would use .groupby() + .filter() methods:

In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
   Group  Id_In_Group  SomeQuantity
3      3            1            16
4      3            2            22
5      3            3             5
6      3            4            12
7      3            5            28

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-06-28 06:56

A slightly different approach using map, value_counts, groupby , filter:

(df[df.Group.map(df.Group.value_counts().ge(4))]
   .groupby('Group')
   .filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))

Breakdown of steps:

Perform value_counts to compute the total counts of distinct elements present in Group column.

>>> df.Group.value_counts()

3    5
4    4
1    2
5    1
2    1
Name: Group, dtype: int64

Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF

>>> df.Group.map(df.Group.value_counts())

0     2
1     2
2     1
3     5
4     5
5     5
6     5
7     5
8     4
9     4
10    4
11    4
12    1
Name: Group, dtype: int64

Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]   

    Group  Id_In_Group  SomeQuantity
3       3            1            16
4       3            2            22
5       3            3             5
6       3            4            12
7       3            5            28
8       4            1             1
9       4            2            28
10      4            3            14
11      4            4             7

Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20. np.any returns all such possiblities matching our filter.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]         \
      .groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])

 Group
3    22
4    18
dtype: int64

From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.

0人赞添加讨论(0) 举报

Slicing Pandas Dataframe according to number of li

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间