I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong?
问题:
回答1:
There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything can happen - if the removed data was redundant, you will probably get better scores, if it was an important part of the problem - you will get worse. Even without dropping dimensions, but just "rotating" input space through PCA can both beneift and harm the process - one must remember that PCA is just a heuristic, when it comes to supervised learning. The only guarantee of PCA is that each consequtive dimension will explain less and less variance, and that it is the best affine transformation in terms of explaining variance in the first K dimensions. That's all. This can be completely unrelated to actual problem, as PCA does not consider labels at all. Given any dataset PCA will transform it in a way which depends only on the positions of points - so for some labelings (consistent with general shape of the data) - it might help, but for many others (more complex patterns of labels) - it will destroy the previously detectable relations. Futhermore, as PCA leads to change of some scalings, you might need different hyperparameters of your classifier - such as regularization strength for LR.
Now getting back to your problem - I would say that in your case the problem is ... a bug in your code. you cannot drop in accuracy significantly below 50%. 10% of accuracy means, that using the opposite of your classifier would give 90% (just answering "false" when it says "true" and the other way around). So even though PCA might not help (or might even harm, as described) - in your case it is an error in your code for sure.