Training a sklearn LogisticRegression classifier w

I am trying to use scikit-learn 0.12.1 to:

train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation

Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.

This results in 2 problems:

The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.

My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.

标签： python machine-learning scikit-learn

3条回答

Animai°情兽

2楼-- · 2019-05-05 01:46

Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,

from itertools import repeat

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
    prob_per_class = (zip(clf.classes_, prob)
                    + zip(classes_not_trained, repeat(0.)))

produces a list of (cls, prob) pairs.

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-05-05 01:46

Building on larsman's excellent answer, I ended up with this:

from itertools import repeat
import numpy as np

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
    prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
    # put the probabilities in class order
    prob_per_class = sorted(prob_per_class)
    new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)

new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.

0人赞添加讨论(0) 举报

爷、活的狠高调

4楼-- · 2019-05-05 01:53

If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:

all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob

0人赞添加讨论(0) 举报

Training a sklearn LogisticRegression classifier w

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间