I have a dataframe with about 100 columns that looks like
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
so my goal is to leave only more global categories : just English, History, Literature, and write in these dataframe the sum of the value of its' components, for instance for English: English-107, English-2
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
so for those proposes I true these two methodes
first method:
df=pd.read_csv(file_path, sep='\t')
df['History']=df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
second method:
df=pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History']=0 #initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History']=df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
, but both give me error
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Could you propose how can I debug this error or, mabe another solution, for my problem. Please, notice, that I have a large dataframe with about 100 columns and 400000 rows, so I'm looking for really optimized solution like as with loc
in pandas
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
Now
is the prefix of the columns. So
does what you want.
In your case, make sure to split using the
'-'
character.Using brilliant DSM's idea:
Output:
Here is another version, which takes care of "Histo/History" problematic..
Output:
PS You may want to add missing categories to
categories
map/dictionary