RandomForestClassfier.fit(): ValueError: could not

Given is a simple CSV file:

A,B,C
Hello,Hi,0
Hola,Bueno,1

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

But I just get this traceback when invoking fit():

ValueError: could not convert string to float: 'Bueno'

scikit-learn version is 0.16.1.

标签： python scikit-learn random-forest

5条回答

\"骚年 ilove

2楼-- · 2019-01-22 02:26

You may not pass str to fit this kind of classifier.

For example, if you have a feature column named 'grade' which has 3 different grades:

A,B and C.

you have to transfer those str "A","B","C" to matrix by encoder like the following:

A = [1,0,0]

B = [0,1,0]

C = [0,0,1]

because the str does not have numerical meaning for the classifier.

In scikit-learn, OneHotEncoder and LabelEncoder are available in inpreprocessing module. However OneHotEncoder does not support to fit_transform() of string. "ValueError: could not convert string to float" may happen during transform.

You may use LabelEncoder to transfer from str to continuous numerical values. Then you are able to transfer by OneHotEncoder as you wish.

In the Pandas dataframe, I have to encode all the data which are categorized to dtype:object. The following code works for me and I hope this will help you.

 from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    for column_name in train_data.columns:
        if train_data[column_name].dtype == object:
            train_data[column_name] = le.fit_transform(train_data[column_name])
        else:
            pass

0人赞添加讨论(0) 举报

贼婆χ

3楼-- · 2019-01-22 02:33

I had a similar issue and found that pandas.get_dummies() solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace train_x = test[cols] with:

train_x = pandas.get_dummies(test[cols])

This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:

   C  A_Hello  A_Hola  B_Bueno  B_Hi
0  0        1       0        0     1
1  1        0       1        1     0

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2019-01-22 02:37

LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):

myData=np.genfromtxt(filecsv, delimiter=",", dtype ="|a20" ,skip_header=1);

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(*NUMBER OF FEATURES*):
    myData[:,i] = le.fit_transform(myData[:,i])

0人赞添加讨论(0) 举报

Luminary・发光体

5楼-- · 2019-01-22 02:44

You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.

There are several classes that can be used :

LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer

Personally I have post almost the same question on StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

0人赞添加讨论(0) 举报

ら.Afraid

6楼-- · 2019-01-22 02:44

You can't pass str to your model fit() method. as it mentioned here

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

Try transforming your data to float and give a try to LabelEncoder.

0人赞添加讨论(0) 举报

RandomForestClassfier.fit(): ValueError: could not

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间