Splitting groupby() in pandas into smaller groups

2019-07-27 07:07发布

问题:

            city  temperature  windspeed   event
            day                                                 
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

The above shows the original data.

Below shows the result using np.array_split(data, 4).

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain  

            day city  temperature  windspeed  event                                                    
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny             
            day city  temperature  windspeed  event                                                  
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy              
            day city  temperature  windspeed  event             
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

As you can see here, I'm trying to create 4 groups from the original data, making sure that each group has all the cities. however, by using array.split(), it split the data into 4 groups but it does not contain all the cities. I want each group to have Mumbai, Paris and New York. How can I do that?

Meaning to say, what I'm trying to achieve is something like below:

Group 1:

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6   Rain
            2017-01-02  paris           50         13  Cloudy
            2017-01-02    mumbai           85         12    Fog, 
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-05    mumbai           89          7   Sunny  
            2017-01-05     paris           43         20   Sunny

Group 2:

            day city  temperature  windspeed  event                                                    
            2017-01-04  new york           33          7  Sunny
            2017-01-01    mumbai           90          5  Sunny
            2017-01-03  paris           54          8  Cloudy
            2017-01-07  new york           27         12    Rain 
            2017-01-06    mumbai           80         10     Fog
            2017-01-09     paris           53         8  Sunny

Group 3:

            day city  temperature  windspeed  event         
            2017-01-02  new york           36          7  Sunny                                         
            2017-01-03  mumbai           87         15    Fog
            2017-01-01   paris           45         20  Sunny,   
            2017-01-08    mumbai           89          8    Rain
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain

Group 4:

            day city  temperature  windspeed  event             
            2017-01-03  new york           28         12   Snow,  
            2017-01-04  mumbai           92          5   Rain
            2017-01-07    mumbai           85         9     Sunny
            2017-01-04  paris           42         10  Cloudy
            2017-01-08     paris           42         15  Cloudy
            2017-01-08  new york           23          7  Rain

As you can see from the expected result, the main thing is that all the groups contain each topic.

What I have in mind is to group the data by city, then from each city's dataframe, divide the data into 4 groups, then for each group in the city, combine the data to get 4 final group.

回答1:

You can create a helper column via GroupBy + cumcount to count the occurrence of each city.

Then use dict + tuple with another GroupBy to create a dictionary of dataframes, each one containing exactly one occurence of each city.

# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()

# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))

Result:

print(d)

{0:                city  temperature  windspeed  event  index
 day                                                      
 2017-01-01  newyork           32          6   Rain      0
 2017-01-01   mumbai           90          5  Sunny      0
 2017-01-01    paris           45         20  Sunny      0,
 1:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-02  newyork           36          7   Sunny      1
 2017-01-02   mumbai           85         12     Fog      1
 2017-01-02    paris           50         13  Cloudy      1,
 2:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-03  newyork           28         12    Snow      2
 2017-01-03   mumbai           87         15     Fog      2
 2017-01-03    paris           54          8  Cloudy      2,
 3:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-04  newyork           33          7   Sunny      3
 2017-01-04   mumbai           92          5    Rain      3
 2017-01-04    paris           42         10  Cloudy      3}

You can then extract individual "groups" via d[0], d[1], d[2], d[3]. In this particular case, you may wish to group by dates instead, i.e.

d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}


回答2:

This is my approach to this. First sort your dataframe by day and city:

df = df.sort_values(by=['day', 'city'])

Next find an even split of 4 groups for your dataframe - if the split is not even then the last group will get the remaining:

n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0,  6, 12, 18, 25], dtype=int32)

groups_n is the start and end for each group. So Group 1 I will take df.iloc[0:6] and Group 4 I will take df.iloc[18:25].

So your final dictionary, d, of the 4 group split of your dataframe will be:

d = {}
for i in range(4):
    d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]

Example Outputs:Group 1 (d[1])

            city      temperature  windspeed    event
day             
2017-01-01  mumbai    90           5            Sunny
2017-01-01  new york  32           6            Rain
2017-01-01  paris     45           20           Sunny
2017-01-02  mumbai    85           12           Fog
2017-01-02  new york  36           7            Sunny
2017-01-02  paris     50           13           Cloudy

Group 4: (d[4])

            city       temperature  windspeed   event
day             
2017-01-07  mumbai     85           9           Sunny
2017-01-07  new york   27           12          Rain
2017-01-07  paris      40           14          Rain
2017-01-08  mumbai     89           8           Rain
2017-01-08  new york   23           7           Rain
2017-01-08  paris      42           15          Cloudy
2017-01-09  paris      53           8           Sunny