Replace string/value in entire DataFrame

2019-03-08 10:30发布

问题:

I have a very large dataset were I want to replace strings with numbers. I would like to operate on the dataset without typing a mapping function for each key (column) in the dataset. (similar to the fillna method, but replace specific string with assosiated value). Is there anyway to do this?

Here is an example of my dataset

data
   resp          A          B          C
0     1       poor       poor       good
1     2       good       poor       good
2     3  very good  very good  very good
3     4       bad        poor       bad 
4     5   very bad   very bad   very bad
5     6       poor       good   very bad
6     7       good       good       good
7     8  very good  very good  very good
8     9       bad        bad    very bad
9    10   very bad   very bad   very bad

The desired result:

 data
   resp  A  B  C
0      1  3  3  4
1     2  4  3  4
2     3  5  5  5
3     4  2  3  2
4     5  1  1  1
5     6  3  4  1
6     7  4  4  4
7     8  5  5  5
8     9  2  2  1
9    10  1  1  1

very bad=1, bad=2, poor=3, good=4, very good=5

//Jonas

回答1:

Use replace

In [126]: df.replace(['very bad', 'bad', 'poor', 'good', 'very good'], 
                     [1, 2, 3, 4, 5]) 
Out[126]: 
      resp  A  B  C
   0     1  3  3  4
   1     2  4  3  4
   2     3  5  5  5
   3     4  2  3  2
   4     5  1  1  1
   5     6  3  4  1
   6     7  4  4  4
   7     8  5  5  5
   8     9  2  2  1
   9    10  1  1  1


回答2:

Considering data is your pandas DataFrame you can also use:

data.replace({'very bad': 1, 'bad': 2, 'poor': 3, 'good': 4, 'very good': 5}, inplace=True)


回答3:

v0.23+ Answer, ...
Includes some performant alternatives, including the use of pd.Categorical.


DataFrame.replace
If the replacement is to be performed on a subset of columns, then you should not call replace on the entire DataFrame. For example, you can slice with DataFrame.loc and call replace with a dict mapping.

df    

   resp          A          B          C
0     1       poor       poor       good
1     2       good       poor       good
2     3  very good  very good  very good
3     4        bad       poor        bad
4     5   very bad   very bad   very bad

mapping = {'very bad': 1, 'bad': 2, 'poor': 3, 'good': 4, 'very good': 5}

df.loc[:,'A':'C'] = df.loc[:,'A':'C'].replace(mapping)
df

   resp  A  B  C
0     1  3  3  4
1     2  4  3  4
2     3  5  5  5
3     4  2  3  2
4     5  1  1  1

If you need to perform regex replacement, add the regex=True flag:

df.loc[:,'A':'C'].replace({'bad': 'x', 'good': 'y'}, regex=True)

        A       B       C
0    poor    poor       y
1       y    poor       y
2  very y  very y  very y
3       x    poor       x
4  very x  very x  very x

If you don't want to modify the original, use DataFrame.join to get "resp" back into a copy:

result = df[['resp']].join(df.loc[:,'A':'C'].replace(mapping))
result

   resp  A  B  C
0     1  3  3  4
1     2  4  3  4
2     3  5  5  5
3     4  2  3  2
4     5  1  1  1

Series.map with DataFrame.apply
You can also apply a Series.map-based replacement on each column. map is generally faster then replace in Series, but your mileage may vary.

df.loc[:,'A':'C'].apply(lambda x: x.map(mapping))

   resp  A  B  C
0     1  3  3  4
1     2  4  3  4
2     3  5  5  5
3     4  2  3  2
4     5  1  1  1

Note that missing mappings from the Series are replaced with NaNs.

df.loc[:,'A':'C'].apply(lambda x: x.map({'poor': '3'}))

     A    B    C
0    3    3  NaN
1  NaN    3  NaN
2  NaN  NaN  NaN
3  NaN    3  NaN
4  NaN  NaN  NaN

pd.Categorical
This is another useful option and quite fast. You can create categories with a specified ordering, and let pandas assign the codes for you.

labels = ['very bad', 'bad', 'poor', 'good', 'very good']
df.loc[:,'A':'C'].apply(
    lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)

   A  B  C
0  3  3  4
1  4  3  4
2  5  5  5
3  2  3  2
4  1  1  1

df[['resp']].join(df.loc[:,'A':'C'].apply(
     lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)
)

   resp  A  B  C
0     1  3  3  4
1     2  4  3  4
2     3  5  5  5
3     4  2  3  2
4     5  1  1  1

Performance
In our tests, we will not concern ourselves with the task of assigning "resp" back to the result. This will aim to time only the replace.

The functions have been timed and plotted using the perfplot module.

Benchmarking Code

import perfplot 
import pandas as pd

data = {
 'A': ['poor', 'good', 'very good', 'bad' , 'very bad'],
 'B': ['poor', 'poor', 'very good', 'poor', 'very bad'],
 'C': ['good', 'good', 'very good', 'bad' , 'very bad']
} 

df = pd.DataFrame(data)

perfplot.show(
    setup=lambda n: pd.concat([df] * n, ignore_index=True),
    kernels=[
        lambda df: df.replace(mapping),
        lambda df: df.apply(lambda x: x.map(mapping)),
        lambda df: df.apply(
            lambda x: pd.Categorical(x, categories=labels, ordered=True).codes + 1)
    ],
    labels=['replace', 'apply + map', 'apply + pd.Categorical'],
    n_range=[2**k for k in range(0, 10)],
    xlabel='N',
    logy=True
)

This only tests the functions on DataFrames that increase depth-wise, not breadth-wise. Depending on the number of rows and columns, your mileage may vary, so the best thing to do would be to test each method on your DataFrame and use what works the best.