On a concrete problem, say I have a DataFrame DF
word tag count
0 a S 30
1 the S 20
2 a T 60
3 an T 5
4 the T 10
I want to find, for every "word", the "tag" that has the most "count". So the return would be something like
word tag count
1 the S 20
2 a T 60
3 an T 5
I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary {'the' : 'S', ...} is just fine.
I hope I can do
DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )
but it doesn't work. I can't access column information.
More abstractly, what does the function in agg(function) see as its argument?
btw, is .agg() the same as .aggregate() ?
Many thanks.
Here's a simple way to figure out what is being passed (the unutbu) solution then 'applies'!
your function just operates (in this case) on a sub-section of the frame with the grouped variable all having the same value (in this cas 'word'), if you are passing a function, then you have to deal with the aggregation of potentially non-string columns; standard functions, like 'sum' do this for you
Automatically does NOT aggregate on the string columns
You ARE aggregating on all columns
You can do pretty much anything within the function
agg
is the same asaggregate
. It's callable is passed the columns (Series
objects) of theDataFrame
, one at a time.You could use
idxmax
to collect the index labels of the rows with the maximum count:yields
and then use
loc
to select those rows in theword
andtag
columns:yields
Note that
idxmax
returns index labels.df.loc
can be used to select rows by label. But if the index is not unique -- that is, if there are rows with duplicate index labels -- thendf.loc
will select all rows with the labels listed inidx
. So be careful thatdf.index.is_unique
isTrue
if you useidxmax
withdf.loc
Alternative, you could use
apply
.apply
's callable is passed a sub-DataFrame which gives you access to all the columns:yields
Using
idxmax
andloc
is typically faster thanapply
, especially for large DataFrames. Using IPython's %timeit:If you want a dictionary mapping words to tags, then you could use
set_index
andto_dict
like this: