I need to transform the independent field from string to arithmetical notation. I am using OneHotEncoder for the transformation. My dataset has many independent columns of which some are as:
Country | Age
--------------------------
Germany | 23
Spain | 25
Germany | 24
Italy | 30
I have to encode the Country column like
0 | 1 | 2 | 3
--------------------------------------
1 | 0 | 0 | 23
0 | 1 | 0 | 25
1 | 0 | 0 | 24
0 | 0 | 1 | 30
I succeed to get the desire transformation via using OneHotEncoder as
#Encoding the categorical data
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
#we are dummy encoding as the machine learning algorithms will be
#confused with the values like Spain > Germany > France
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
Now I'm getting the depreciation message to use categories='auto'
. If I do so the transformation is being done for the all independent columns like country, age, salary etc.
How to achieve the transformation on the dataset 0th column only?
There is actually 2 warnings :
and the second :
In the future, you should not define the columns in the OneHotEncoder directly, unless you want to use "categories='auto'". The first message also tells you to use OneHotEncoder directly, without the LabelEncoder first. Finally, the second message tells you to use ColumnTransformer, which is like a Pipe for columns transformations.
Here is the equivalent code for your case :
See also : ColumnTransformer documentation
For the above example;
Dont use the labelencoder and directly use OneHotEncoder.
Reminder will keep previous data while [0]th column will replace will be encoded
As of version 0.22, you can write the same code as below:
As you can see, you don't need to use
LabelEncoder
anymore.There is a way that you can do one hot encoding with pandas. Python:
Give names to the newly formed columns add it to your dataframe. Check the pandas documentation here.
Use the following code :-