I'm working with pandas. My goal is to convert several columns within a dataframe from containing either NaN or string data, into more or less a dummy variable (0's for NaN; 1's for any string). I'd like to do this without using a complete list of strings and replacing them each, because there are typos and this would lead to errors. I've been able to replace all the NaN data with 0's using the fillna function, which works like a dream!
I am hoping for something similar that will replace all string data with 1's, but leave the 0's in place. I've searched stackoverflow and elsewhere, to little avail.
The data look roughly like this, where I only want this to apply to columns starting with T_:
fol T_opp T_Dir T_Enh Activity
1 0 0 vo hf
2 vr 0 0 hx
2 0 0 0 fe
3 0 bt 0 rn
I'd like the output to look the same, but with "vr" "bt" and "vo" each replaced with the integer 1. From what I can tell, the pd get_dummies function is not what I'm looking for. I also can't make this work with replace(). I tried something using a T/F mask and a list of zeros, but the outcome was so wrong I won't bother to post the code here.
Edited: I've added an additional column in the toy data above. The 'Activity' column is some data, also strings, that I do not want to touch.
You can do this with
DataFrame.replace()
with a regular expression:If for some reason you're against
dict
s, you can be very explicit about it too:But wait there's more! You can specify the columns you want to operate on by passing a nested
dict
(keys cannot be regular expressions, well, they can but it won't do anything except return the frame):EDIT: Since you to replace all strings with the number
1
(as per your comments below) do:EDIT: Microbenchmarks might be useful here:
Andy's method (faster):
DataFrame.replace()
:If you have columns containing strings that you want to keep
Yet another way is to use
filter
and join the results together after replacement:Note that the original order of the columns is not retained.
You can use regular expressions to search for column names, which might be more useful than explicitly constructing a list if you have many columns to keep. The
-
operator performs set difference when used with twoIndex
objects (df.columns
is anIndex
).You'll probably need to call
DataFrame.convert_objects()
afterward unless your columns are mixed string/integer columns. My solution assumes they are all strings so I callconvert_objects()
to coerce the values toint
dtype
.Another option is to do this the other way around, first convert to numeric:
And then fill in the NaNs with 1: