Replace any string in columns with 1

2019-07-22 00:42发布

问题:

I'm working with pandas. My goal is to convert several columns within a dataframe from containing either NaN or string data, into more or less a dummy variable (0's for NaN; 1's for any string). I'd like to do this without using a complete list of strings and replacing them each, because there are typos and this would lead to errors. I've been able to replace all the NaN data with 0's using the fillna function, which works like a dream!

I am hoping for something similar that will replace all string data with 1's, but leave the 0's in place. I've searched stackoverflow and elsewhere, to little avail.

The data look roughly like this, where I only want this to apply to columns starting with T_:

    fol    T_opp    T_Dir    T_Enh   Activity
    1      0        0        vo      hf
    2      vr       0        0       hx
    2      0        0        0       fe
    3      0        bt       0       rn

I'd like the output to look the same, but with "vr" "bt" and "vo" each replaced with the integer 1. From what I can tell, the pd get_dummies function is not what I'm looking for. I also can't make this work with replace(). I tried something using a T/F mask and a list of zeros, but the outcome was so wrong I won't bother to post the code here.

Edited: I've added an additional column in the toy data above. The 'Activity' column is some data, also strings, that I do not want to touch.

回答1:

Another option is to do this the other way around, first convert to numeric:

In [11]: df.convert_objects(convert_numeric=True)
Out[11]: 
   fol  T_opp  T_Dir  T_Enh Activity
0    1      0      0    NaN       hf
1    2    NaN      0      0       hx
2    2      0      0      0       fe
3    3      0    NaN      0       rn

And then fill in the NaNs with 1:

In [12]: df.convert_objects(convert_numeric=True).fillna(1)
Out[12]: 
   fol  T_opp  T_Dir  T_Enh Activity
0    1      0      0      1       hf
1    2      1      0      0       hx
2    2      0      0      0       fe
3    3      0      1      0       rn


回答2:

You can do this with DataFrame.replace() with a regular expression:

In [14]: df
Out[14]:
   fol T_opp T_Dir T_Enh
0    1     0     0    vo
1    2    vr     0     0
2    2     0     0     0
3    3     0    bt     0

In [15]: df.replace(regex={'vr|bt|vo': '1'}).convert_objects(convert_numeric=True)
Out[15]:
   fol T_opp T_Dir T_Enh
0    1     0     0     1
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

If for some reason you're against dicts, you can be very explicit about it too:

In [19]: df.replace(regex='vr|bt|vo', value='1')
Out[19]:
   fol T_opp T_Dir T_Enh
0    1     0     0     1
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

But wait there's more! You can specify the columns you want to operate on by passing a nested dict (keys cannot be regular expressions, well, they can but it won't do anything except return the frame):

In [22]: df.replace({'T_opp': {'vr': 1}, 'T_Dir': {'bt': 1}})
Out[22]:
   fol T_opp T_Dir T_Enh
0    1     0     0    vo
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

EDIT: Since you to replace all strings with the number 1 (as per your comments below) do:

In [23]: df.replace(regex={r'\D+': 1})
Out[23]:
   fol T_opp T_Dir T_Enh
0    1     0     0     1
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

EDIT: Microbenchmarks might be useful here:

Andy's method (faster):

In [11]: timeit df.convert_objects(convert_numeric=True).fillna(1)
1000 loops, best of 3: 590 µs per loop

DataFrame.replace():

In [46]: timeit df.replace(regex={r'\D': 1})
1000 loops, best of 3: 801 µs per loop

If you have columns containing strings that you want to keep

In [45]: cols_to_replace = 'T_opp', 'T_Dir', 'T_Enh'

In [46]: d = dict(zip(cols_to_replace, [{r'\D': 1}] * len(cols_to_replace)))

In [47]: d
Out[47]: {'T_Dir': {'\\D': 1}, 'T_Enh': {'\\D': 1}, 'T_opp': {'\\D': 1}}

In [48]: df.replace(d)
Out[48]:
   fol T_opp T_Dir T_Enh Activity
0    1     0     0     1       hf
1    2     1     0     0       hx
2    2     0     0     0       fe
3    3     0     1     0       rn

Yet another way is to use filter and join the results together after replacement:

In [10]: df
Out[10]:
   fol T_opp T_Dir T_Enh Activity
0    1     0     0    vo       hf
1    2    vr     0     0       hx
2    2     0     0     0       fe
3    3     0    bt     0       rn

In [11]: filtered = df.filter(regex='T_.+')

In [12]: res = filtered.replace({'\D': 1})

In [13]: res
Out[13]:
  T_opp T_Dir T_Enh
0     0     0     1
1     1     0     0
2     0     0     0
3     0     1     0

In [14]: not_filtered = df[df.columns - filtered.columns]

In [15]: not_filtered
Out[15]:
  Activity  fol
0       hf    1
1       hx    2
2       fe    2
3       rn    3

In [16]: res.join(not_filtered)
Out[16]:
  T_opp T_Dir T_Enh Activity  fol
0     0     0     1       hf    1
1     1     0     0       hx    2
2     0     0     0       fe    2
3     0     1     0       rn    3

Note that the original order of the columns is not retained.

You can use regular expressions to search for column names, which might be more useful than explicitly constructing a list if you have many columns to keep. The - operator performs set difference when used with two Index objects (df.columns is an Index).

You'll probably need to call DataFrame.convert_objects() afterward unless your columns are mixed string/integer columns. My solution assumes they are all strings so I call convert_objects() to coerce the values to int dtype.