I am new to scikit-learn library and have been trying to play with it for prediction of stock prices. I was going through its documentation and got stuck at the part where they explain OneHotEncoder()
. Here is the code that they have used :
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Can someone please explain it to me step by step what is happening here? I have a clear idea how One hot encoder works but I'm not able to figure out how this code works. Any help is appreciated. Thanks!
Let's take these features one at a time:
We're fitting an encoder to a set of four vectors, with 3 features each.
Clear?
The representation will concatenate the vectors for the three features. Since there are three features, the representation will always have three "True" entries (1), the rest "False" (0).
Since there are 2+3+4 possible values, the representation is 9 entries long.
End barricade at index 9
Encoding the given values simply concatenates the three one-vectors, for the values 0, 1, 1:
Slap those end-to-end, convert to the given float format, and we have the array shown in the example.
Lets start off first by writing down what you would expect (assuming you know what One Hot Encoding means)
unecoded
encoded
To get encoded:
if you use the default
n_values='auto'
. In using default='auto' you're specifying that the values your features (columns of unencoded) could possibly take on can be inferred from the values in the columns of the data handed tofit
.That brings us to
enc.n_values_
from the docs:
The above means that f0 (column 1) can take on 2 values (0, 1), f1 can take on 3 values, (0, 1, 2) and f2 can take on 4 values (0, 1, 2, 3).
Indeed these are the values from the features f1, f2 ,f3 in the unencoded feature matrix.
then,
from the docs:
Given is the range of positions (in the encoded space) that features f1, f2, f3 can take on.
Mapping the vector [0, 1, 1] into one hot encoded space (under the mapping by we got from enc.fit):
How?
The first feature in the f0 so that maps to position 0 (if the element was 1 instead of 0 we would map it into position 1).
The next element 1 maps into position 3 because f1 starts at position 2 and the element 1 is the second possible value f1 can take on.
Finally the third element 1 takes on position 6 since it the second possible value f2 takes on and f2 starts getting mapped from position 5.
Hope that clears up some stuff.