I am working on image classification using Gaussian Mixture Models. I have around 34,000 features, belonging to three classes, all lying in a 23 dimensional space. I performed feature scaling on both the training and testing data using different methods, and I observed that accuracy actually reduces after performing scaling. I performed feature scaling because there was a difference of many orders between many features. I am curious to know why this is happening, I thought that feature scaling would increase the accuracy, especially given the large differences in features.
相关问题
- Keras Model using Tensorflow Distribution for loss
- Pandas: Ignore string columns while doing normaliz
- Why is ToUpperInvariant() faster than ToLowerInvar
- how to replace Latin unicode character to [a-z] ch
- How to normalize coordinates inputs for neural net
相关文章
- Python Pandas Dataframe: Normalize data between 0.
- symfony no supporting normalizer found while norma
- Does storing aggregated data go against database n
- extract digits from string in Obj-C [duplicate]
- TypeScript: An index signature parameter must be a
- Is it really worth it to normalize the “Toxi” way?
- Database Designing: An art or headache (Managing r
- enums in SQL Server database
Welcome to the real world buddy.
In general, it is quite true that you want features to be in the same "scale" so that you don't have some features "dominating" other features. This is especially so if your machine learning algorithm is inherently "geometrical" in nature. By "geometrical", I mean it treats the samples are points in a space, and relies on distances (usually Euclidean/L2 as is your case) between points in making its predictions, i.e., the spatial relationships of the points matter. GMM and SVM are algorithms of this nature.
However, feature scaling can screw things up, especially if some features are categorical/ordinal in nature, and you didn't properly preprocess them when you appended them to the rest of your features. Furthermore, depending on your feature scaling method, presence of outliers for a particular feature can also screw up the feature scaling for that feature. For e.g., a "min/max" or "unit variance" scaling is going to be sensitive to outliers (e.g., if one of your feature encodes yearly income or cash balance and there are a few mi/billionaires in your dataset).
Also, when you experience a problem such as this, the cause may not be obvious. It does not mean you perform feature scaling, result goes bad, then feature scaling is at fault. It could be that your method was screwed up to begin with, and the result after feature scaling just happens to be more screwed up.
So what could be other cause(s) of your problem?
So what can you do?
Basically, my point is don't (just) treat machine learning field/algorithms as black boxes and a bunch of tricks which you memorize and try at random. Try to understanding the algorithm/math under the hood. That way, you'll be better able to diagnose the problem(s) you encounter.
EDIT (in response to request for clarification by @Zee)
For papers, the only one I can recall off the top of my head is A Practical Guide to Support Vector Classification by the authors of LibSVM. Examples therein show the importance of feature scaling for SVM on various datasets. E.g., consider the RBF/Gaussian kernel. This kernel uses the square L2 norm. If your features are of different scale, this will affect the value.
Also, how you represent your features matter. E.g., changing a variable that represents height from meters to cm or inches will affect algorithms such as PCA (because variance along direction for that feature has changed.) Note this is different from the "typical" scaling (e.g., min/max, Z-score etc.) in that this is a matter of representation. The person is still the same height regardless of the unit. Whereas typical feature scaling "transform" the data, which changes the "height" of the person. Prof. David Mackay, on the Amazon page of his book, Information Theory for Machine Learning, has a comment in this vein when asked why he did not include PCA in his book.
For ordinal and categorical variables, they are mentioned briefly in Bayesian Reasoning for Machine Learning, The Elements of Statistical Learning. They mention ways to encode them as features, for e.g., replacing a variable that can represent 3 categories with 3 binary variables, with one set to "1" to indicate the sample has that category. This is important for methods such as Linear Regression (or Linear Classifiers). Note this is about encoding categorical variables/features, not scaling per se, but they are part of the feature preprocessing set up, and hence useful to know. More can be found in Hal Duame III's book below.
The book A Course in Machine Learning by Hal Duame III. Search for "scaling". One of the earliest example in the book is how it affects KNN (which just uses L2 distance, which GMM, SVM etc. uses if you use the RBF/gaussian kernel). More details are given in the chapter 4, "Machine Learning in Practice". Unfortunately the images/plots are not shown in the PDF. This book has one of the nicest treatments on feature encoding and scaling, especially if you work on Natural Language Processing (NLP). E.g., see his explanation of applying the logarithm to features (i.e., log transform). That way, sums of logs become log of product of features, and "effects"/"contributions" of these features are tapered by the logarithm.
Note that all the aforementioned textbooks are freely downloadable from the above links.