I have the following dataset:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest
but dask doesn't have an nlargest
function for groupby. Have been playing around with apply
but can't seem to get it to work properly.
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
But I just get the error ValueError: Wrong number of items passed 0, placement implies 8