Given is a simple CSV file:
A,B,C
Hello,Hi,0
Hola,Bueno,1
Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:
cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)
train_y = test['C'] == 1
train_x = test[cols]
clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)
But I just get this traceback when invoking fit():
ValueError: could not convert string to float: 'Bueno'
scikit-learn version is 0.16.1.
You may not pass
str
to fit this kind of classifier.For example, if you have a feature column named 'grade' which has 3 different grades:
A,B and C.
you have to transfer those
str
"A","B","C" to matrix by encoder like the following:because the
str
does not have numerical meaning for the classifier.In scikit-learn,
OneHotEncoder
andLabelEncoder
are available ininpreprocessing
module. HoweverOneHotEncoder
does not support tofit_transform()
of string. "ValueError: could not convert string to float" may happen during transform.You may use
LabelEncoder
to transfer fromstr
to continuous numerical values. Then you are able to transfer byOneHotEncoder
as you wish.In the Pandas dataframe, I have to encode all the data which are categorized to
dtype:object
. The following code works for me and I hope this will help you.I had a similar issue and found that pandas.get_dummies() solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace
train_x = test[cols]
with:This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:
LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):
You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.
There are several classes that can be used :
Personally I have post almost the same question on StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.
You can't pass
str
to your modelfit()
method. as it mentioned hereTry transforming your data to float and give a try to LabelEncoder.