I have a dataframe with about 100 columns that looks like
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
so my goal is to leave only more global categories : just English, History, Literature, and write in these dataframe the sum of the value of its' components, for instance for English: English-107, English-2
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
so for those proposes I true these two methodes
first method:
df=pd.read_csv(file_path, sep='\t')
df['History']=df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
second method:
df=pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History']=0 #initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History']=df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
, but both give me error
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Could you propose how can I debug this error or, mabe another solution, for my problem. Please, notice, that I have a large dataframe with about 100 columns and 400000 rows, so I'm looking for really optimized solution like as with loc
in pandas