Multi-label out-of-core learning for text data: Va

2019-06-24 03:11发布

问题:

I am trying to build a multi-label out-of-core text classifier. As described here, the idea is to read (large scale) text data sets in batches and partially fitting them to the classifiers. Additionally, when you have multi-label instances as described here, the idea is to build that many binary classifiers as the number of classes in the data set, in an One-Vs-All manner.

When combining the MultiLabelBinarizer and OneVsRestClassifier classes of sklearn with partial fitting I get the following error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The code is the following:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]

mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,         non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))

X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)

You can imagine that the last three lines are applied to each minibatch, the code of which I have removed for the sake of simplicity.

If you remove the OneVsRestClassifier and use MultinomialNB only, the code runs fine.

回答1:

You are passing y_train as transformed from MultiLabelBinarizer which are in the form of [[1, 1, 0], [0, 1, 0], [1, 1, 0]], but passing categories as ['a','b','c'] which is then passed through this line the code:-

if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
                 "must be subset of {1}").format(np.unique(y),
                                              self.classes_))

which results in a array of boolean values such as [False, True, ..]. if cannot handle such arrays as a single truth value and hence the error.

First thing is you should pass classes in same numerical format as Y_train. Now even if you do that, then the internal label_binarizer_ of OneVsRestClassifier will decide that it is of type "multiclass" rather than multilabel and will then refuse to transform the classes correctly. This in my opinion is a bug in OneVsRestClassifer and/or LabelBinarizer.

Please submit an issue to scikit-learn github about partial_fit and see what happens.

Update Apparently, deciding "multilabel" or "multiclass" from target vector (y) is a currenlty ongoing issue on scikit-learn because of all the complications surrounding it.

  • https://github.com/scikit-learn/scikit-learn/issues/7665
  • https://github.com/scikit-learn/scikit-learn/issues/5959
  • https://github.com/scikit-learn/scikit-learn/issues/7931
  • https://github.com/scikit-learn/scikit-learn/issues/8098
  • https://github.com/scikit-learn/scikit-learn/issues/7628
  • https://github.com/scikit-learn/scikit-learn/pull/2626


回答2:

So maybe a different answer then you'd expect, but I would recommend that you don't use the OneVsRestClassifier in favour of using the scikit-multilearn library built on top of scikit-learn, that provides multi-label classifier which are more state of the art than the simple OneVsRest.

You can find an example of how to use scikit-multilearn in the tutorial. A review of approaches to multi-label classification can be found in Tsoumakas's introduction to MLC.

But if it happens that you have label's that are co-occurring with each other, I would recommend using a different classifier for example Label Powerset with label space division using fast greedy community detection on the output space - I explain why this works in my paper about label space division.

Converting your code to use scikit-multilearn would make it look as follows:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset

categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]

mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,         non_negative=True)

X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)

# base single-label classifier 
base_classifier = MultinomialNB(alpha=0.01)

# problem transformation from multi-label to single-label 
transformation_classifier = LabelPowerset(base_classifier)

# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True) 

# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)

clf.fit(X_train, Y_train)