I am trying to build a multi-label out-of-core text classifier. As described here, the idea is to read (large scale) text data sets in batches and partially fitting them to the classifiers. Additionally, when you have multi-label instances as described here, the idea is to build that many binary classifiers as the number of classes in the data set, in an One-Vs-All manner.
When combining the MultiLabelBinarizer and OneVsRestClassifier classes of sklearn with partial fitting I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The code is the following:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)
You can imagine that the last three lines are applied to each minibatch, the code of which I have removed for the sake of simplicity.
If you remove the OneVsRestClassifier and use MultinomialNB only, the code runs fine.
You are passing y_train as transformed from MultiLabelBinarizer
which are in the form of [[1, 1, 0], [0, 1, 0], [1, 1, 0]], but passing categories as ['a','b','c']
which is then passed through this line the code:-
if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
"must be subset of {1}").format(np.unique(y),
self.classes_))
which results in a array of boolean values such as [False, True, ..].
if
cannot handle such arrays as a single truth value and hence the error.
First thing is you should pass classes in same numerical format as Y_train
.
Now even if you do that, then the internal label_binarizer_
of OneVsRestClassifier will decide that it is of type "multiclass" rather than multilabel
and will then refuse to transform the classes correctly. This in my opinion is a bug in OneVsRestClassifer and/or LabelBinarizer.
Please submit an issue to scikit-learn github about partial_fit
and see what happens.
Update
Apparently, deciding "multilabel" or "multiclass" from target vector (y) is a currenlty ongoing issue on scikit-learn because of all the complications surrounding it.
- https://github.com/scikit-learn/scikit-learn/issues/7665
- https://github.com/scikit-learn/scikit-learn/issues/5959
- https://github.com/scikit-learn/scikit-learn/issues/7931
- https://github.com/scikit-learn/scikit-learn/issues/8098
- https://github.com/scikit-learn/scikit-learn/issues/7628
- https://github.com/scikit-learn/scikit-learn/pull/2626
So maybe a different answer then you'd expect, but I would recommend that you don't use the OneVsRestClassifier in favour of using the scikit-multilearn library built on top of scikit-learn, that provides multi-label classifier which are more state of the art than the simple OneVsRest.
You can find an example of how to use scikit-multilearn in the tutorial. A review of approaches to multi-label classification can be found in Tsoumakas's introduction to MLC.
But if it happens that you have label's that are co-occurring with each other, I would recommend using a different classifier for example Label Powerset with label space division using fast greedy community detection on the output space - I explain why this works in my paper about label space division.
Converting your code to use scikit-multilearn would make it look as follows:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
# base single-label classifier
base_classifier = MultinomialNB(alpha=0.01)
# problem transformation from multi-label to single-label
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(X_train, Y_train)