I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances
are labeled with 5
. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
First:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?
I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
Class weights
The weights from the
class_weight
parameter are used to train the classifier. They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
The metrics
Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned:
accuracy
,recall_score
,f1_score
...Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
I will not detail all these metrics but note that, with the exception of
accuracy
, they are naturally applied at the class level: as you can see in thisprint
of a classification report they are defined for each class. They rely on concepts such astrue positives
orfalse negative
that require defining which class is the positive one.The warning
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output one global number for the f1-score? You could:
avg / total
result above. It's also called macro averaging.'weighted'
in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an
average
argument for the score method.Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
Computing scores
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
Here's a way to do it using
StratifiedShuffleSplit
, which gives you a random splits of your data (after shuffling) that preserve the label distribution.Hope this helps.
Posed question
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:
Application-specific question
However, returning to your task, I'd research 2 topics:
Commonly used metrics. As I can infer after looking through literature, there are 2 main evaluation metrics:
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
(link)
(link) - they explore both accuracy and MSE, considering the latter to be better
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
Cost of different errors. If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn't square diff; otherwise stay with Accuracy.
About approaches, not metrics
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.
First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite
average
parameter with one of these values:'micro'
(calculate metrics globally),'macro'
(calculate metrics for each label) or'weighted'
(same as macro but with auto weights).All your Warnings came after calling metrics functions with default
average
value'binary'
which is inappropriate for multiclass prediction.Good luck and have fun with machine learning!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
This way you end up with tangible and interpretable numbers for each of the classes.
Then...
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread. However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.