I am working on image classification using Gaussian Mixture Models. I have around 34,000 features, belonging to three classes, all lying in a 23 dimensional space. I performed feature scaling on both the training and testing data using different methods, and I observed that accuracy actually reduces after performing scaling. I performed feature scaling because there was a difference of many orders between many features. I am curious to know why this is happening, I thought that feature scaling would increase the accuracy, especially given the large differences in features.
问题:
回答1:
I thought that feature scaling would increase the accuracy, especially given the large differences in features.
Welcome to the real world buddy.
In general, it is quite true that you want features to be in the same "scale" so that you don't have some features "dominating" other features. This is especially so if your machine learning algorithm is inherently "geometrical" in nature. By "geometrical", I mean it treats the samples are points in a space, and relies on distances (usually Euclidean/L2 as is your case) between points in making its predictions, i.e., the spatial relationships of the points matter. GMM and SVM are algorithms of this nature.
However, feature scaling can screw things up, especially if some features are categorical/ordinal in nature, and you didn't properly preprocess them when you appended them to the rest of your features. Furthermore, depending on your feature scaling method, presence of outliers for a particular feature can also screw up the feature scaling for that feature. For e.g., a "min/max" or "unit variance" scaling is going to be sensitive to outliers (e.g., if one of your feature encodes yearly income or cash balance and there are a few mi/billionaires in your dataset).
Also, when you experience a problem such as this, the cause may not be obvious. It does not mean you perform feature scaling, result goes bad, then feature scaling is at fault. It could be that your method was screwed up to begin with, and the result after feature scaling just happens to be more screwed up.
So what could be other cause(s) of your problem?
- My guess for the most likely cause is that you have high-dimensional data and not enough training samples. This is because your GMM is going to estimating covariance matrices using data that is 34000 in dimension. Unless you have a lot of data, chances are one or more of your covariance matrices (one for each gaussian) are going to be near singular or singular. This means the predictions from your GMM are nonsense to begin with because your gaussians "blew" up, and/or the EM algorithm just gave up after a predefined number of iterations.
- Poor testing methodology. You did not have data divided into proper training/validation/test sets, and you did not perform the testing properly. What "good" performance you have in the beginning was not credible. This is actually very common, as the natural tendency is to test using the training data the model was fitted on and not on a validation or test set.
So what can you do?
- Don't use a GMM for image categorization. Use a proper supervised learning algorithm, especially if you have known image categories as labels. In particular, to avoid the feature scaling altogether, use random forest or its variants (e.g., extremely randomized trees).
- Get more training data. Unless you are classifying "simple" (i.e., "toy"/synthetic images) or you are classifying them into a few image classes (e.g., <= 5. Note this is just a random small number I pulled out of the air.), you really to have a good deal of images per class. A good starting point is to get at least a couple of hundreds per class, or use a more sophisticated algorithm to exploit the structure within your data to arrive at better performance.
Basically, my point is don't (just) treat machine learning field/algorithms as black boxes and a bunch of tricks which you memorize and try at random. Try to understanding the algorithm/math under the hood. That way, you'll be better able to diagnose the problem(s) you encounter.
EDIT (in response to request for clarification by @Zee)
For papers, the only one I can recall off the top of my head is A Practical Guide to Support Vector Classification by the authors of LibSVM. Examples therein show the importance of feature scaling for SVM on various datasets. E.g., consider the RBF/Gaussian kernel. This kernel uses the square L2 norm. If your features are of different scale, this will affect the value.
Also, how you represent your features matter. E.g., changing a variable that represents height from meters to cm or inches will affect algorithms such as PCA (because variance along direction for that feature has changed.) Note this is different from the "typical" scaling (e.g., min/max, Z-score etc.) in that this is a matter of representation. The person is still the same height regardless of the unit. Whereas typical feature scaling "transform" the data, which changes the "height" of the person. Prof. David Mackay, on the Amazon page of his book, Information Theory for Machine Learning, has a comment in this vein when asked why he did not include PCA in his book.
For ordinal and categorical variables, they are mentioned briefly in Bayesian Reasoning for Machine Learning, The Elements of Statistical Learning. They mention ways to encode them as features, for e.g., replacing a variable that can represent 3 categories with 3 binary variables, with one set to "1" to indicate the sample has that category. This is important for methods such as Linear Regression (or Linear Classifiers). Note this is about encoding categorical variables/features, not scaling per se, but they are part of the feature preprocessing set up, and hence useful to know. More can be found in Hal Duame III's book below.
The book A Course in Machine Learning by Hal Duame III. Search for "scaling". One of the earliest example in the book is how it affects KNN (which just uses L2 distance, which GMM, SVM etc. uses if you use the RBF/gaussian kernel). More details are given in the chapter 4, "Machine Learning in Practice". Unfortunately the images/plots are not shown in the PDF. This book has one of the nicest treatments on feature encoding and scaling, especially if you work on Natural Language Processing (NLP). E.g., see his explanation of applying the logarithm to features (i.e., log transform). That way, sums of logs become log of product of features, and "effects"/"contributions" of these features are tapered by the logarithm.
Note that all the aforementioned textbooks are freely downloadable from the above links.