Recursive feature elimination on Random Forest usi

I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process.

However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem.

Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the optimal grouping as I am using recursive feature selection to try to minimize the amount of data I will input into the final classifier.

Here's some example code:

from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
    ranking_ = rfe.fit(X_train, y_train).ranking_
  File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
    if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

标签： python pandas scikit-learn random-forest feature-selection

4条回答

Melony?

2楼-- · 2019-02-02 11:41

Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.

from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix


def get_enhanced_confusion_matrix(actuals, predictions, labels):
    """"enhances confusion_matrix by adding sensivity and specificity metrics"""
    cm = confusion_matrix(actuals, predictions, labels = labels)
    sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
    specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
    weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
    return cm, sensitivity, specificity, weightedAccuracy

iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')

response, _  = pandas.factorize(y)

xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
                                }).sort(['imp'], ascending = False).reset_index(drop = True)

cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)

rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures], 
                              'weightedAccuracy':[weightedAccuracy], 
                              'sensitivity':[sensitivity], 
                              'specificity':[specificity]})

print "running RFE on  %d features"%numFeatures

for i in range(1,numFeatures,1):
    varsUsed = importances['name'][0:i]
    print "now using %d of %s features"%(len(varsUsed), numFeatures)
    xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
    rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
                                n_jobs = -1, verbose = 1)
    rf.fit(xTrain, yTrain)
    cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
    print("\n"+str(cm))
    print('the sensitivity is %d percent'%(sensitivity * 100))
    print('the specificity is %d percent'%(specificity * 100))
    print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
    rfeMatrix = rfeMatrix.append(
                                pandas.DataFrame({'numFeatures':[len(varsUsed)], 
                                'weightedAccuracy':[weightedAccuracy], 
                                'sensitivity':[sensitivity], 
                                'specificity':[specificity]}), ignore_index = True)    
print("\n"+str(rfeMatrix))    
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()

print "the final features used are %s"%featuresUsed

0人赞添加讨论(0) 举报

女痞

3楼-- · 2019-02-02 11:45

Here's what I've done to adapt RandomForestClassifier to work with RFECV:

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

Just using this class does the trick if you use 'accuracy' or 'f1' score. For 'roc_auc', RFECV complains that multiclass format is not supported. Changing it to two-class classification with the code below, the 'roc_auc' scoring works. (Using Python 3.4.1 and scikit-learn 0.15.1)

y=(pd.Series(iris.target, name='target')==2).astype(int)

Plugging into your code:

from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)

0人赞添加讨论(0) 举报

姐就是有狂的资本

4楼-- · 2019-02-02 11:48

This is my code, I've tidied it up a bit to make it relevant to your task:

features_to_use = fea_cols #  this is a list of features
# empty dataframe
trim_5_df = DataFrame(columns=features_to_use)
run=1
# this will remove the 5 worst features determined by their feature importance computed by the RF classifier
while len(features_to_use)>6:
    print('number of features:%d' % (len(features_to_use)))
    # build the classifier
    clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
    # train the classifier
    clf.fit(train[features_to_use], train['OpenStatusMod'].values)
    print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
    # predict the class and print the classification report, f1 micro, f1 macro score
    pred = clf.predict(test[features_to_use])
    print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
    print('micro score: ')
    print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
    print('macro score:\n')
    print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
    # predict the class probabilities
    probs = clf.predict_proba(test[features_to_use])
    # rescale the priors
    new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
    # calculate logloss with the rescaled probabilities
    print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
    row={}
    if hasattr(clf, "feature_importances_"):
        # sort the features by importance
        sorted_idx = np.argsort(clf.feature_importances_)
        # reverse the order so it is descending
        sorted_idx = sorted_idx[::-1]
        # add to dataframe
        row['num_features'] = len(features_to_use)
        row['features_used'] = ','.join(features_to_use)
        # trim the worst 5
        sorted_idx = sorted_idx[: -5]
        # swap the features list with the trimmed features
        temp = features_to_use
        features_to_use=[]
        for feat in sorted_idx:
            features_to_use.append(temp[feat])
        # add the logloss performance
        row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
    print('')
    # add the row to the dataframe
    trim_5_df = trim_5_df.append(DataFrame(row))
run +=1

So what I'm doing here is I have a list of features I want to train and then predict against, using the feature importances I then trim the worst 5 and repeat. During each run I add a row to record the prediction performance so that I can do some analysis later.

The original code was much bigger I had different classifiers and datasets I was analysing but I hope you get the picture from the above. The thing I noticed was that for random forest the number of features I removed on each run affected the performance so trimming by 1, 3 and 5 features at a time resulted in a different set of best features.

I found that using a GradientBoostingClassifer was more predictable and repeatable in the sense that the final set of best features agreed whether I trimmed 1 feature at a time or 3 or 5.

I hope I'm not teaching you to suck eggs here, you probably know more than me, but my approach to ablative anlaysis was to use a fast classifier to get a rough idea of the best sets of features, then use a better performing classifier, then start hyper parameter tuning, again doing coarse grain comaprisons and then fine grain once I get a feel of what the best params were.

0人赞添加讨论(0) 举报

狗以群分

5楼-- · 2019-02-02 11:56

I submitted a request to add coef_ so RandomForestClassifier may be used with RFECV. However, the change had already been made. This change will be in version 0.17.

https://github.com/scikit-learn/scikit-learn/issues/4945

You can pull the latest dev build if you want to use it now.

0人赞添加讨论(0) 举报

Recursive feature elimination on Random Forest usi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间