I have a data frame df
and I use several columns from it to groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
On
groupby
object, theagg
function can take a list to apply several aggregation methods at once. This should give you the result you need:We can easily do it by using groupby and count. But, we should remember to use reset_index().
Quick Answer:
The simplest way to get row counts per group is by calling
.size()
, which returns aSeries
:Usually you want this result as a
DataFrame
(instead of aSeries
) so you can do:If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
First let's use
.size()
to get the row counts:Then let's use
.size().reset_index(name='counts')
to get the row counts:Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using
join
. It looks like this:Footnotes
The code used to generate the test data is shown below:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop
NaN
entries in the mean calculation without telling you about it.