Python Scikit Random Forest Regressor Error

I am trying to load training and test data from a csv, run the random forest regressor in scikit/sklearn, and then predict the output from the test file.

The TrainLoanData.csv file contains 5 columns; the first column is the output and the next 4 columns are the features. The TestLoanData.csv contains 4 columns - the features.

When I run the code, I get error:

    predicted_probs = ["%f" % x[1] for x in predicted_probs]
IndexError: invalid index to scalar variable.

What does this mean?

Here is my code:

import numpy, scipy, sklearn, csv_io //csv_io from https://raw.github.com/benhamner/BioResponse/master/Benchmarks/csv_io.py
from sklearn import datasets
from sklearn.ensemble import RandomForestRegressor

def main():
    #read in the training file
    train = csv_io.read_data("TrainLoanData.csv")
    #set the training responses
    target = [x[0] for x in train]
    #set the training features
    train = [x[1:] for x in train]
    #read in the test file
    realtest = csv_io.read_data("TestLoanData.csv")

    # random forest code
    rf = RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
    # fit the training data
    print('fitting the model')
    rf.fit(train, target)
    # run model against test data
    predicted_probs = rf.predict(realtest)
    print predicted_probs
    predicted_probs = ["%f" % x[1] for x in predicted_probs]
    csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)

main()

标签： python machine-learning scipy scikit-learn random-forest

3条回答

傲

2楼-- · 2019-06-21 16:23

The return value from a RandomForestRegressor is an array of floats:

In [3]: rf = RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)

In [4]: rf.fit([[1,2,3],[4,5,6]],[-1,1])
Out[4]: 
RandomForestRegressor(bootstrap=True, compute_importances=False,
           criterion='mse', max_depth=None, max_features='auto',
           min_density=0.1, min_samples_leaf=1, min_samples_split=2,
           n_estimators=10, n_jobs=-1, oob_score=False,
           random_state=<mtrand.RandomState object at 0x7fd894d59528>,
           verbose=0)

In [5]: rf.predict([1,2,3])
Out[5]: array([-0.6])

In [6]: rf.predict([[1,2,3],[4,5,6]])
Out[6]: array([-0.6,  0.4])

So you're trying to index a float like (-0.6)[1], which is not possible.

As a side note, the model does not return probabilities.

0人赞添加讨论(0) 举报

相关推荐>>

3楼-- · 2019-06-21 16:30

First, it's always helpful to also have the sample data to reproduce and debug your problem. If they are too big or secret, you could extract the interesting part of them.

The contents of the variable predicted_probs seems not to be as you expect. It seems to be a list (or array) of integers, and this is also what I'd expect.

In sklearn, the X.train() method always take the trainingdata and their corresonding classes (usually integers or strings). The X.predict() method then only takes validation data and returns the prediction results, i.e., for each set in the validation data one class (again integer or string).

If you want to know how good the accuracy of the trained classifier is, you must not just train and predict, but you must do a cross validation, i.e., repeatedly train and validate and each time check how many predictions were correct. sklean has an excellent documentation, I'm sure you will find the respective section. If not, ask me.

0人赞添加讨论(0) 举报

时光不老，我们不散

4楼-- · 2019-06-21 16:33

Try to use numpy's "genfromtxt" instead of "csv_io.read_data" for dataset loading - it will automatically transform your data in csv to numpy array. And reading Getting Started With Python For Data Science article will be useful for you...

0人赞添加讨论(0) 举报

Python Scikit Random Forest Regressor Error

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间