I have a very large dataset were I want to replace strings with numbers. I would like to operate on the dataset without typing a mapping function for each key (column) in the dataset. (similar to the fillna method, but replace specific string with assosiated value).
Is there anyway to do this?
Here is an example of my dataset
data
resp A B C
0 1 poor poor good
1 2 good poor good
2 3 very good very good very good
3 4 bad poor bad
4 5 very bad very bad very bad
5 6 poor good very bad
6 7 good good good
7 8 very good very good very good
8 9 bad bad very bad
9 10 very bad very bad very bad
The desired result:
data
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
5 6 3 4 1
6 7 4 4 4
7 8 5 5 5
8 9 2 2 1
9 10 1 1 1
very bad=1, bad=2, poor=3, good=4, very good=5
//Jonas
Use replace
In [126]: df.replace(['very bad', 'bad', 'poor', 'good', 'very good'],
[1, 2, 3, 4, 5])
Out[126]:
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
5 6 3 4 1
6 7 4 4 4
7 8 5 5 5
8 9 2 2 1
9 10 1 1 1
Considering data
is your pandas DataFrame
you can also use:
data.replace({'very bad': 1, 'bad': 2, 'poor': 3, 'good': 4, 'very good': 5}, inplace=True)
v0.23+ Answer, ...
Includes some performant alternatives, including the use of pd.Categorical
.
DataFrame.replace
If the replacement is to be performed on a subset of columns, then you should not call replace
on the entire DataFrame. For example, you can slice with DataFrame.loc
and call replace
with a dict mapping.
df
resp A B C
0 1 poor poor good
1 2 good poor good
2 3 very good very good very good
3 4 bad poor bad
4 5 very bad very bad very bad
mapping = {'very bad': 1, 'bad': 2, 'poor': 3, 'good': 4, 'very good': 5}
df.loc[:,'A':'C'] = df.loc[:,'A':'C'].replace(mapping)
df
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
If you need to perform regex replacement, add the regex=True
flag:
df.loc[:,'A':'C'].replace({'bad': 'x', 'good': 'y'}, regex=True)
A B C
0 poor poor y
1 y poor y
2 very y very y very y
3 x poor x
4 very x very x very x
If you don't want to modify the original, use DataFrame.join
to get "resp" back into a copy:
result = df[['resp']].join(df.loc[:,'A':'C'].replace(mapping))
result
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
Series.map
with DataFrame.apply
You can also apply
a Series.map
-based replacement on each column. map
is generally faster then replace
in Series, but your mileage may vary.
df.loc[:,'A':'C'].apply(lambda x: x.map(mapping))
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
Note that missing mappings from the Series are replaced with NaNs.
df.loc[:,'A':'C'].apply(lambda x: x.map({'poor': '3'}))
A B C
0 3 3 NaN
1 NaN 3 NaN
2 NaN NaN NaN
3 NaN 3 NaN
4 NaN NaN NaN
pd.Categorical
This is another useful option and quite fast. You can create categories with a specified ordering, and let pandas assign the codes for you.
labels = ['very bad', 'bad', 'poor', 'good', 'very good']
df.loc[:,'A':'C'].apply(
lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)
A B C
0 3 3 4
1 4 3 4
2 5 5 5
3 2 3 2
4 1 1 1
df[['resp']].join(df.loc[:,'A':'C'].apply(
lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)
)
resp A B C
0 1 3 3 4
1 2 4 3 4
2 3 5 5 5
3 4 2 3 2
4 5 1 1 1
Performance
In our tests, we will not concern ourselves with the task of assigning "resp" back to the result. This will aim to time only the replace.
The functions have been timed and plotted using the perfplot
module.
Benchmarking Code
import perfplot
import pandas as pd
data = {
'A': ['poor', 'good', 'very good', 'bad' , 'very bad'],
'B': ['poor', 'poor', 'very good', 'poor', 'very bad'],
'C': ['good', 'good', 'very good', 'bad' , 'very bad']
}
df = pd.DataFrame(data)
perfplot.show(
setup=lambda n: pd.concat([df] * n, ignore_index=True),
kernels=[
lambda df: df.replace(mapping),
lambda df: df.apply(lambda x: x.map(mapping)),
lambda df: df.apply(
lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)
],
labels=['replace', 'apply + map', 'apply + pd.Categorical'],
n_range=[2**k for k in range(0, 10)],
xlabel='N',
logy=True
)
This only tests the functions on DataFrames that increase depth-wise, not breadth-wise. Depending on the number of rows and columns, your mileage may vary, so the best thing to do would be to test each method on your DataFrame and use what works the best.