Should I normalize training and test test separate

2019-07-13 11:43发布

I want to normalize my data in the range [0,1]. Should I normalize data after shuffling and splitting?Should I repeat the same procedure for test test? I came across a python code which was using such type of normalization. Is this the correct way to normalize data with target range [0,1]

`X_train = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.],[ 0.,  1., -1.]])
a= X_train
for i in range(3):
    old_range = np.amax(a[:,i]) - np.amin(a[:,i])
    new_range = 1 - 0
    f = ((a[:,i] - np.amin(a[:,i])) / old_range)*new_range + 0
    lis.append(f)
b = np.transpose(np.array(lis))
print(b)`

Here is my result after normalization.

`[[0.5, 0., 1.]
[1., 0.5, 0.33333333]
[0., 1., 0.]]`

1条回答
男人必须洒脱
2楼-- · 2019-07-13 12:21

Should I normalize data after shuffling and splitting?

Yes. Otherwise, you are leaking information from the future (i.e., test here). More information here; it is for standardization, and not normalization, (and R, not Python) but the arguments are equally applicable.

Should I repeat the same procedure for test?

Yes. Using the scaler that was fitted to the training dataset. In this case, it means using the max and min from the training dataset for scaling the test dataset. This ensures consistency with the transformation performed on the training data and makes it possible to evaluate if the model can generalize well.

You do not have to code it from scratch. Using sklearn:

import numpy as np
from sklearn import preprocessing

X_train = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.],[ 0.,  1., -1.]])
X_test = np.array([[ 0, -1.,  1.5], [ 2.5,  0.,  1]])

scaler = preprocessing.MinMaxScaler()
scaler = scaler.fit(X_train)

X_train_minmax = scaler.transform(X_train)
X_test_minmax = scaler.transform(X_test)

Note: for most applications, standardization is the recommended approach for scaling preprocessing.StandardScaler()

查看更多
登录 后发表回答