Slicing Pandas Dataframe according to number of li

2019-06-28 06:27发布

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.

Suppose I have a dataframe df loking like this :

Group   Id_In_Group   SomeQuantity
1        1              10
1        2              20
2        1               7
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28
4        1               1
4        2              18
4        3              14
4        4               7
5        1              36

I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).

In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :

Group   Id_In_Group   SomeQuantity
3        1              16
3        2              22
3        3               5
3        4              12
3        5              28

(being or not sorted by SomeQuantity, whatever).

Could somebody be kind enough to help me? :)

3条回答
ら.Afraid
2楼-- · 2019-06-28 06:43

This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.

Find groups with "4 objects in the group"

import collections

groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups

Out:[3, 4]

Use these groups to filter to a new df containing these groups:

df2 = df[df.Group.isin(groups)]

"4th SomeQuantity (after sorting) is 22 (>=20)"

 df3 = df2.sort_values(by='SomeQuantity',ascending=False)

(Updated as per comment below...)

df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()

    Group   Id_In_Group SomeQuantity
  3    3        1       16
  4    3        2       22
  5    3        3       5
  6    3        4       12
  7    3        5       28
查看更多
爷、活的狠高调
3楼-- · 2019-06-28 06:45

I would use .groupby() + .filter() methods:

In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
   Group  Id_In_Group  SomeQuantity
3      3            1            16
4      3            2            22
5      3            3             5
6      3            4            12
7      3            5            28
查看更多
Root(大扎)
4楼-- · 2019-06-28 06:56

A slightly different approach using map, value_counts, groupby , filter:

(df[df.Group.map(df.Group.value_counts().ge(4))]
   .groupby('Group')
   .filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))

enter image description here


Breakdown of steps:

Perform value_counts to compute the total counts of distinct elements present in Group column.

>>> df.Group.value_counts()

3    5
4    4
1    2
5    1
2    1
Name: Group, dtype: int64

Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF

>>> df.Group.map(df.Group.value_counts())

0     2
1     2
2     1
3     5
4     5
5     5
6     5
7     5
8     4
9     4
10    4
11    4
12    1
Name: Group, dtype: int64

Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]   

    Group  Id_In_Group  SomeQuantity
3       3            1            16
4       3            2            22
5       3            3             5
6       3            4            12
7       3            5            28
8       4            1             1
9       4            2            28
10      4            3            14
11      4            4             7

Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20. np.any returns all such possiblities matching our filter.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]         \
      .groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])

 Group
3    22
4    18
dtype: int64

From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.

查看更多
登录 后发表回答