I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:
mlb = MultiLabelBinarizer()
X = dataframe['body'].values
y = mlb.fit_transform(dataframe['tag'].values)
classifier = Pipeline([
('vectorizer', CountVectorizer(lowercase=True,
stop_words='english',
max_df = 0.8,
min_df = 10)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
predicted = cross_val_predict(classifier, X, y)
When running my code I get multiple warnings:
UserWarning: Label not :NUMBER: is present in all training examples.
When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.
Why is this happening, is it related to warnings it prints out while training is running? How can I avoid those empty predictions?
EDIT01:
This is also happening when using other estimators than
LinearSVC()
.
I've tried RandomForestClassifier()
and it gives empty predictions as well. Strange thing is, when I use cross_val_predict(classifier, X, y, method='predict_proba')
for predicting probabilities for each label, instead of binary decisions 0/1, there is always at least one label per predicted set with probability > 0 for given document. So I dont know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?
EDIT02:
I have found an old post where OP was dealing with similar problem. Is this the same case?
Why is this happening, is it related to warnings it prints out while training is running?
The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let train_indices
be an array with the indices of the training samples. If a particular tag (of index k
) does not occur in the training sample, all the elements in the k
-th column of the indicator matrix y[train_indices]
are zeros.
How can I avoid those empty predictions?
In the scenario described above the classifier will not be able to reliably predict the k
-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made by clf.predict
and you need to implement the prediction function on your own, for example by using the decision values returned by clf.decision_function
as suggested in this answer.
So I don't know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?
In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.
I have found an old post where OP was dealing with similar problem. Is this the same case?
Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.
Demo
To further explain the issue I have elaborated a simple toy example using mock data.
Q = {'What does the "yield" keyword do in Python?': ['python'],
'What is a metaclass in Python?': ['oop'],
'How do I check whether a file exists using Python?': ['python'],
'How to make a chain of function decorators?': ['python', 'decorator'],
'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
'MATLAB: get variable type': ['matlab'],
'Why is MATLAB so fast in matrix multiplication?': ['performance'],
'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
}
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})
mlb = MultiLabelBinarizer()
X = dataframe['body'].values
y = mlb.fit_transform(dataframe['tag'].values)
classifier = Pipeline([
('vectorizer', CountVectorizer(lowercase=True,
stop_words='english',
max_df=0.8,
min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
Please, notice that I have set min_df=1
since my dataset is much smaller than yours. When I run the following sentence:
predicted = cross_val_predict(classifier, X, y)
I get a bunch of warnings
C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
str(classes[c]))
and the following prediction:
In [5]: np.set_printoptions(precision=2, threshold=1000)
In [6]: predicted
Out[6]:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
Those rows whose entries are all 0
indicate that no tag is predicted for the corresponding document.
Workaround
For the sake of the analysis, let us validate the model manually rather than through cross_val_predict
.
import warnings
from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()
with warnings.catch_warnings(record=True) as received_warnings:
warnings.simplefilter("always")
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
classifier.fit(X_train, y_train)
predicted_test = classifier.predict(X_test)
for w in received_warnings:
print w.message
When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):
Label not 2 is present in all training examples.
Label not 4 is present in all training examples.
This is consistent with the fact that tags of indices 2
and 4
are missing from the training samples:
In [40]: y_train
Out[40]:
array([[0, 0, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1]])
For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in predicted_test
):
In [42]: predicted_test
Out[42]:
array([[0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0]])
To overcome that issue, you could implement your own prediction function like this:
def get_best_tags(clf, X, lb, n_tags=3):
decfun = clf.decision_function(X)
best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
return lb.classes_[best_tags]
By doing so, each document is always assigned the n_tag
tags with the highest confidence score:
In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]
In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]:
array([['matlab', 'oop', 'matlab-oop'],
['oop', 'matlab-oop', 'matlab'],
['oop', 'matlab-oop', 'matlab'],
['matlab', 'naming-conventions', 'oop']], dtype=object)