Another pandas question.
Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:
Suppose I have some info about tips.
In [119]:
tips.head()
Out[119]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female False Sun Dinner 2 0.059447
1 10.34 1.66 Male False Sun Dinner 3 0.160542
2 21.01 3.50 Male False Sun Dinner 3 0.166587
3 23.68 3.31 Male False Sun Dinner 2 0.139780
4 24.59 3.61 Female False Sun Dinner 4 0.146808
and I want to know the five largest tips in relation to the total bill, that is, tip_pct
for smokers and non-smokers separately. So this works:
def top(df, n=5, column='tip_pct'):
return df.sort_index(by=column)[-n:]
In [101]:
tips.groupby('smoker').apply(top)
Out[101]:
total_bill tip sex smoker day time size tip_pct
smoker
False 88 24.71 5.85 Male False Thur Lunch 2 0.236746
185 20.69 5.00 Male False Sun Dinner 5 0.241663
51 10.29 2.60 Female False Sun Dinner 2 0.252672
149 7.51 2.00 Male False Thur Lunch 2 0.266312
232 11.61 3.39 Male False Sat Dinner 2 0.291990
True 109 14.31 4.00 Female True Sat Dinner 2 0.279525
183 23.17 6.50 Male True Sun Dinner 4 0.280535
67 3.07 1.00 Female True Sat Dinner 1 0.325733
178 9.60 4.00 Female True Sun Dinner 2 0.416667
172 7.25 5.15 Male True Sun Dinner 2 0.710345
Good enough, but then I wanted to use pandas' transform to do the same like this:
def top_all(df):
return df.sort_index(by='tip_pct')
tips.groupby('smoker').transform(top_all)
but instead I get this:
TypeError: Transform function invalid for data types
Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?
transform
is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution withapply
is fine.So suppose
tips.groupby('smoker').transform(func)
. There will be two groups, call them group1 and group2. The transform does not callfunc(group1)
andfunc(group2)
. Instead, it callsfunc(group1['total_bill'])
, thenfunc(group1['tip'])
, etc., and thenfunc(group2['total_bill'])
,func(group2['tip'])
. Here's an example:You can see that
foo
is first called with just the A column of the C=1 group of the original data frame, then the B column of that group, then the A column of the C=2 group, etc.This makes sense if you think about what transform is for. It's meant for applying transform functions on the groups. But in general, these functions won't make sense when applied to the entire group, only to a given column. For instance, the example in the pandas docs is about z-standardizing using
transform
. If you have a DataFrame with columns for age and weight, it wouldn't make sense to z-standardize with respect to the overall mean of both these variables. It doesn't even mean anything to take the overall mean of a bunch of numbers, some of which are ages and some of which are weights. You have to z-standardize the age with respect to the mean age and the weight with respect to the mean weight, which means you want to transform separately for each column.So basically, you don't need to use transform here.
apply
is the appropriate function here, becauseapply
really does operate on each group as a single DataFrame, whiletransform
operates on each column of each group.