I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg
method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr
of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0'
(NO-BREAK SPACE), for example.