Do I need to scale test data and Dependent variabl

2019-06-04 03:22发布

问题:

I am new to the concept of scaling a feature in Machine Learning, I read that scaling will be useful when one feature range is very high when compared to other features. But if I choose to scale the training data then:

  1. Can I just scale that one feature that has high range?
  2. If I scale the entire X of train data then do I need to also scale the y of train data and entire test data?

回答1:

  1. Yes, you can scale that one feature that has high range, but do ensure that there is no other feature that has a high range, because if it exist and has not been scaled then that feature will make the algorithm overlook the contributions of the scaled features and effect the result(output value) with even a slight change in it. It is recommended( but not compulsory) to scale all the features in the training set.
  2. You do not need to scale the Y of train data as the algorithm or model will set the parameter values to get least Cost(error), that is k{Y(output)-Y(original)} anyway. But if the Xtrain was scaled then the test set(feature values, Xtest)(Scale Ytest only if the Ytrain was scaled) needs to be scaled(using training mean and variance) before feeding it to the model because the model hasn't seen this data before and has been trained on data with scaled range, so if the test data has a feature value diverging from the corresponding feature range in train data by a considerably high value then the model will output a wrong prediction for the corresponding test data.


回答2:

Yes, you can scale a single feature. You can interpret scaling as a means of giving the same importance to each feature. For instance, imagine you have data about people and you describe your examples via two features: height and weight. If you measure height in meters and weight in kilograms, a k-Nearest Neighbours classifier when computing the distance between two examples is likely to make its decisions solely based on the weight. In that case, you can scale one of the features to the same range of the other. Commonly, we scale all the features to the same range (e.g. 0 - 1). In addition, remember that all the values you use to scale your training data must be used to scale the test data.

As for the dependent variable y you do not need to scale it.