If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
You don't need a lambda here, nor do you explicitly have to query
df.columns
;groupby
accepts alevel
argument you can specify in conjunction with theaxis
argument. This is cleaner, IMO.Handling
MultiIndex
esAnother case to consider is when dealing with
MultiIndex
columns. ConsiderTo perform aggregation across the upper levels, use
or, if aggregating per upper level only, use
Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use
Index.duplicated
:Or, to keep the last ones, specify
keep='last'
(default is'first'
),The
groupby
alternatives for the two solutions above aredf.groupby(level=0, axis=1).first()
, and... .last()
, respectively.I believe this does what you are after:
Alternatively, between 3% and 15% faster depending on the length of the df:
EDIT: To extend this beyond sums, use
.agg()
(short for.aggregate()
):