Grouping Pandas dataframe across rows

2019-07-12 02:57发布

问题:

I've a csv like this:

client1,client2,client3,client4,client5,client6,amount
,,,Comp1,,,4.475000
,,,Comp2,,,16.305584
,,,Comp3,,,4.050000
Comp2,Comp1,,Comp4,,,21.000000
,,,Comp4,,,30.000000
,Comp1,,Comp2,,,5.137500
,,,Comp3,,,52.650000
,,,Comp1,,,2.650000
Comp3,,,Comp3,,,29.000000
Comp5,,,Comp2,,,20.809000
Comp5,,,Comp2,,,15.100000
Comp5,,,Comp2,,,52.404000

After reading it into a pandas dataframe, df, I wanted to aggregate in two steps:

Step1:

First, I sum the amount:

client1 client2 client3 client4 client5 client6  amount
                        Comp1                    7.125000
                        Comp2                    16.305584
                        Comp3                    56.700000
                        Comp4                    30.000000
         Comp1          Comp2                    5.137500
Comp2    Comp1          Comp4                    21.000000
Comp3                   Comp3                    29.000000
Comp5                   Comp2                    88.313000  

Then, I want to aggregate by each client name such that if multiple clients are involved like in group 5, then 5.1375 must be split equally between Comp1 and Comp2. Tried it this way:

df.groupby(['client1','client2','client3','client4','client5','client6']).apply(lambda x: x['amount'].sum()/len(x) if x.any().nunique()>=1 else x['amount'].sum())



client1 client2 client3 client4 client5 client6 0
0                           Comp1                   3.562500
1                           Comp2                   16.305584
2                           Comp3                   28.350000
3                           Comp4                   30.000000
4           Comp1           Comp2                   5.137500
5   Comp2   Comp1           Comp4                   21.000000
6   Comp3                   Comp3                   29.000000
7   Comp5                   Comp2                   29.437667

Expected Output is:

Client Amount 
Comp1  4.475+21/3+5.1375/2+2.65 = 16.69375
Comp2  16.305584+21/3+20.809/2+15.10/2+52.404/2 = 67.462084
Comp3  4.05+52.65+29 = 85.7
Comp4  21/3+30 = 37
Comp5  20.809/2+15.10/2+52.404/2 = 44.1565

I tried using sum(axis=0) but of no use.

回答1:

We can use a bit a maths here

cols = ['amount'] 
# Divide the amount by non null fields 
df['new'] = df['amount']/df.drop(cols,1).notnull().sum(1)

#Set the index as new by droping amount column, unstack and drop the nans.
x = df.drop(cols,1).set_index('new').unstack().dropna()

#Create dataframe just from amount and the clients
ndf = pd.DataFrame({'amount':x.index.droplevel(0).values,'clients':x.values})

#Groupby client and get the sum 
ndf.groupby('clients').sum()

Output:

          amount
clients           
Comp1    16.360417
Comp2    69.697501
Comp3    85.700000
Comp4    36.666667
Comp5    44.156500


回答2:

I'd organize it like this:

d = df.drop('amount', 1)  # new df without `amount`
a = df.amount             # separate series of `amount`
c = d.count(1)            # count of non-null values

a.div(c).repeat(c).groupby(d.stack().values).sum()

Comp1    16.693750
Comp2    70.030834
Comp3    85.700000
Comp4    37.000000
Comp5    44.156500
dtype: float64