Is it good to normalization/standardization data h

I'm having data with around 60 features and most will be zeros most of the time in my training data only 2-3 cols may have values( to be precise its perf log data). however, my test data will have some values in some other columns.

I've done normalization/standardization(tried both separately) and feed it to PCA/SVD(tried both separately). I used these features in to fit my model but, it is giving very inaccurate results.

Whereas, if I skip normalization/standardization step and directly feed my data to PCA/SVD and then to the model, its giving accurate results(almost above 90% accuracy).

P.S.: I've to do anomaly detection so using Isolation Forest algo.

why these results are varying?

标签： python machine-learning pca svd normalize

2条回答

家丑人穷心不美

2楼-- · 2020-06-28 13:23

Normalization and standarization (depending on the source they sometimes are used equivalently, so I'm not sure what you mean exactly by each one in this case, but it's not important) are a general recommendation that usually works well in problems where the data is more or less homogeneously distributed. Anomaly detection however is, by definition, not that kind of problem. If you have a data set where most of the examples belong to class A and only a few belong to class B, it is possible (if not necessary) that sparse features (features that are almost always zero) are actually very discriminative for your problem. Normalizing them will basically turn them to zero or almost zero, making it hard for a classifier (or PCA/SVD) to actually grasp their importance. So it is not unreasonable that you get better accuracy if you skip the normalization, and you shouldn't feel you are doing it "wrong" just because you are "supposed to do it"

I don't have experience with anomaly detection, but I have some with unbalanced data sets. You could consider some form of "weighted normalization", where the computation of the mean and variance of each feature is weighted with a value inversely proportional to the number of examples in the class (e.g. examples_A ^ alpha / (examples_A ^ alpha + examples_B ^ alpha), with alpha some small negative number). If your sparse features have very different scales (e.g. one is 0 in 90% of cases and 3 in 10% of cases and another is 0 in 90% of cases and 80 in 10% of cases), you could just scale them to a common range (e.g. [0, 1]).

In any case, as I said, do not apply techniques just because they are supposed to work. If something doesn't work for your problem or particular dataset, you are rightful not to use it (and trying to understand why it doesn't work may yield some useful insights).

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2020-06-28 13:32

Any features that only have zeros (or any other constant value) in the training set, are not and cannot be useful for any ML model. You should discard them. The model cannot learn any information from them so it won't matter that the test data do have some non-zero values.

Generally, you should do normalization or standardization before feeding data for PCA/SVD, otherwise these methods will catch wrong patterns in the data (e.g. if features are on a different scale between each other).

Regarding the reason behind such a difference in the accuracy, I'm not sure. I guess it has to do with some peculiarities of the dataset.

0人赞添加讨论(0) 举报

Is it good to normalization/standardization data h

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间