Pandas: group by equal range

2019-07-14 06:36发布

问题:

This is an example of my data frame:

df_lst = [
  {"wordcount": 100, "Stats": 198765, "id": 34},
     {"wordcount": 99, "Stats": 98765, "id": 35},
     {"wordcount": 200, "Stats": 18765, "id": 36},
     {"wordcount": 250, "Stats": 788765, "id": 37},
     {"wordcount": 345, "Stats": 12765, "id": 38},
     {"wordcount": 456, "Stats": 238765, "id": 39},
     {"wordcount": 478, "Stats": 1934, "id": 40},
     {"wordcount": 890, "Stats": 19845, "id": 41},
     {"wordcount": 812, "Stats": 1987, "id": 42}]
df = pd.DataFrame(df_lst)
df.set_index('id', inplace=True)
df.head()

DF:

    Stats   wordcount
id      
34  198765  100
35  98765   99
36  18765   200
37  788765  250
38  12765   345

I'd like to calculate the average Stats for each range of wordcount with a step of 100 so the new data frame looks something like this:

    Average wordcount
    194567  100
    23456   200
    2378    300
    ...

Where 100 means from 0-100 etc. I started writing multiple conditions but have a feeling there is a more efficient way of achieving this. Will appreciate your help.

回答1:

use pd.cut() method:

In [92]: bins = np.arange(0, df['wordcount'].max().round(-2) + 100, 100)

In [94]: df.groupby(pd.cut(df['wordcount'], bins=bins, labels=bins[1:]))['Stats'].mean()
Out[94]:
wordcount
100    148765.0
200     18765.0
300    788765.0
400     12765.0
500    120349.5
600         NaN
700         NaN
800         NaN
900     10916.0
Name: Stats, dtype: float64


回答2:

import math
def roundup(x):
    return int(math.ceil(x / 100.0)) * 100
df['roundup']=df.wordcount.apply(roundup)
df.groupby('roundup').Stats.mean()
Out[824]: 
roundup
100    148765.0
200     18765.0
300    788765.0
400     12765.0
500    120349.5
900     10916.0
Name: Stats, dtype: float64