replace substring in pandas data frame column

2019-05-23 16:03发布

I am working with dataframe that contains column named "raw_parameter_name". In this column i have different string values. Several values are like following pattern "ABCD;MEAN". What i am trying to do is to replace each value "ABCD;MEAN" with "ABCD;X-BAR". Sub string "ABCD" may vary but pattern ";MEAN" is constant i want to replace. Looked into different options using "replace" method but don't know how to replace sub string only and not whole string. Please advise. Thank you in advance

3条回答
Anthone
2楼-- · 2019-05-23 16:27

You can use regex module re for example:

import pandas as pd
import re

df = pd.DataFrame({"row_parameter_name":['abcd;MEAN','Dogg11;MEAN',';MEAN']})

Out[126]:
  row_parameter_name
0          abcd;MEAN
1        Dogg11;MEAN
2              ;MEAN 

df['row_parameter_name'] = df['row_parameter_name'].apply(lambda x: re.sub(';MEAN$',';X-BAR',x))

In [128]: df
Out[128]:
  row_parameter_name
0         abcd;X-BAR
1       Dogg11;X-BAR
2             ;X-BAR
查看更多
一纸荒年 Trace。
3楼-- · 2019-05-23 16:36

use str.contains to create a boolean index to mask the series and then str.replace to replace your substring:

In [172]:
df = pd.DataFrame({'raw_parameter_name':['ABCD;MEAN', 'EFGH;MEAN', '1234;MEAN', 'sdasd;MEAT']})
df

Out[172]:
  raw_parameter_name
0          ABCD;MEAN
1          EFGH;MEAN
2          1234;MEAN
3         sdasd;MEAT

In [173]:
df.loc[df['raw_parameter_name'].str.contains(';MEAN$'), 'raw_parameter_name'] = df['raw_parameter_name'].str.replace('MEAN', 'X-BAR')
df

Out[173]:
  raw_parameter_name
0           ABCD;X-BAR
1           EFGH;X-BAR
2           1234;X-BAR
3         sdasd;MEAT

Here it matches where the substrin ';MEAN' exists the $ is a terminating symbol

The boolean mask looks like the following:

In [176]:
df['raw_parameter_name'].str.contains(';MEAN$')

Out[176]:
0     True
1     True
2     True
3    False
Name: raw_parameter_name, dtype: bool

Timings

For a 40,0000 row df using str.replace is faster than using apply:

In [183]:
import re
%timeit df['raw_parameter_name'].apply(lambda x: re.sub(';MEAN$',';X-BAR',x))
%timeit df['raw_parameter_name'].str.replace('MEAN', 'X-BAR')
​
1 loops, best of 3: 1.01 s per loop
1 loops, best of 3: 687 ms per loop
查看更多
放荡不羁爱自由
4楼-- · 2019-05-23 16:39

You do not have to use re like in the example that was marked correct above. It may have been necessary at one point in time, but this is not the best answer to this anymore.

Nor do you need to use str.contains() first.

Instead just use .str.replace() with the appropriate match and replacement.

In [2]: df = pd.DataFrame({"row_parameter_name":['abcd;MEAN','Nothing;NICE','Dogg11;MEAN',';MEAN','MEANY-MEANY;MEAN']})

In [3]: df
Out[3]: row_parameter_name
        0   abcd;MEAN
        1   Nothing;NICE
        2   Dogg11;MEAN
        3   ;MEAN
        4   MEANY-MEANY;MEAN

In [4]: df.row_parameter_name.str.replace("MEAN$","X-BAR")
Out[4]: 0           abcd;X-BAR
        1         Nothing;NICE
        2         Dogg11;X-BAR
        3               ;X-BAR
        4    MEANY-MEANY;X-BAR
        Name: row_parameter_name, dtype: object
查看更多
登录 后发表回答