I am unable to understand the page of the StandardScaler
in the documentation of sklearn
.
Can anyone explain this to me in simple terms?
I am unable to understand the page of the StandardScaler
in the documentation of sklearn
.
Can anyone explain this to me in simple terms?
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.
This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.
Following is a simple working example to explain how standarization calculation works. The theory part is already well explained in other answers.
Calculation
As you can see in the output, mean is [6. , 2.5] and std deviation is [1.41421356, 0.8660254 ]
Data is (0,1) position is 2 Standardization = (2 - 2.5)/0.8660254 = -0.57735027
Data in (1,0) position is 4 Standardization = (4-6)/1.41421356 = -1.414
Result After Standardization
Check Mean and Std Deviation After Standardization
Note: -2.77555756e-17 is very close to 0.
References
Compare the effect of different scalers on data with outliers
What's the difference between Normalization and Standardization?
Mean of data scaled with sklearn StandardScaler is not zero
After applying
StandardScaler()
, each column in X will have mean of 0 and standard deviation of 1.Formulas are listed by others on this page.
Rationale: some algorithms require data to look like this (see sklearn docs).
The main idea is to normalize/standardize (
mean = 0
andstandard deviation = 1
) your features/variables/columns ofX
before applying machine learning techniques.One important thing that you should keep in mind is that most (if not all)
scikit-learn
models/classes/functions, expect as input a matrixX
with dimensions/shape[number_of_samples, number_of_features]
. This is very important. Some other libraries expect as input the inverse.IMPORTNANT:
StandardScaler()
will normalize the features (each column of X, INDIVIDUALLY !!!) so that each column/feature/variable will havemean = 0
andstandard deviation = 1
.P.S: I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is not true either correct.
Example:
Verify that the mean of each feature (column) is 0:
Verify that the std of each feature (column) is 1:
The maths:
UPDATE 08/2019: Concering the input parameters
with_mean
andwith_std
toFalse
/True
, I have provided an answer here: https://stackoverflow.com/a/57381708/5025009The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can't find what example had caused me concern. All columns ARE scaled separately as described by those above.
CODE
OUTPUT