I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:
mlb = MultiLabelBinarizer()
X = dataframe['body'].values
y = mlb.fit_transform(dataframe['tag'].values)
classifier = Pipeline([
('vectorizer', CountVectorizer(lowercase=True,
stop_words='english',
max_df = 0.8,
min_df = 10)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
predicted = cross_val_predict(classifier, X, y)
When running my code I get multiple warnings:
UserWarning: Label not :NUMBER: is present in all training examples.
When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.
Why is this happening, is it related to warnings it prints out while training is running? How can I avoid those empty predictions?
EDIT01: This is also happening when using other estimators than
LinearSVC()
.
I've tried RandomForestClassifier()
and it gives empty predictions as well. Strange thing is, when I use cross_val_predict(classifier, X, y, method='predict_proba')
for predicting probabilities for each label, instead of binary decisions 0/1, there is always at least one label per predicted set with probability > 0 for given document. So I dont know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?
EDIT02: I have found an old post where OP was dealing with similar problem. Is this the same case?
The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let
train_indices
be an array with the indices of the training samples. If a particular tag (of indexk
) does not occur in the training sample, all the elements in thek
-th column of the indicator matrixy[train_indices]
are zeros.In the scenario described above the classifier will not be able to reliably predict the
k
-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made byclf.predict
and you need to implement the prediction function on your own, for example by using the decision values returned byclf.decision_function
as suggested in this answer.In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.
Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.
Demo
To further explain the issue I have elaborated a simple toy example using mock data.
Please, notice that I have set
min_df=1
since my dataset is much smaller than yours. When I run the following sentence:I get a bunch of warnings
and the following prediction:
Those rows whose entries are all
0
indicate that no tag is predicted for the corresponding document.Workaround
For the sake of the analysis, let us validate the model manually rather than through
cross_val_predict
.When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):
This is consistent with the fact that tags of indices
2
and4
are missing from the training samples:For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in
predicted_test
):To overcome that issue, you could implement your own prediction function like this:
By doing so, each document is always assigned the
n_tag
tags with the highest confidence score:I too had the same error. Then I used LabelEncoder() instead of MultiLabelBinarizer() to encode the labels.
I am not getting that error anymore.