I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model? Thanks in advance.
相关问题
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- ParameterError: Audio buffer is not finite everywh
- Best way to chop a signature off an email body
- How to calculate logistic regression accuracy
相关文章
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Invert MinMaxScaler from scikit_learn
- How should I vectorize the following list of lists
- ValueError: Unknown metric function when using cus
- Python Keras cross_val_score Error
you can use fit then transform learn
transform
You first need to split the data into training and test set (validation set might also be required).
Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).
Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.