All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
Now I know and I assume that internally, factorize
and LabelEncoder
work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.
get_dummies
and OneHotEncoder
will yield the same result but OneHotEncoder
can only handle numbers but get_dummies
will take all kinds of input. get_dummies
will generate new column names automatically for each column input, but OneHotEncoder
will not (it rather will assign new column names 1,2,3....). So get_dummies
is better in all respectives.
Please correct me if I am wrong! Thank you!
These four encoders can be split in two categories:
factorize
and scikit-learnLabelEncoder
. The result will have 1 dimension.get_dummies
and scikit-learnOneHotEncoder
. The result will have n dimensions, one by distinct value of the encoded categorical variable.The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with
fit
andtransform
methods.Encode labels into categorical variables
Pandas
factorize
and scikit-learnLabelEncoder
belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.Encode categorical variable into dummy/indicator (binary) variables
Pandas
get_dummies
and scikit-learnOneHotEncoder
belong to the second category. They can be used to create binary variables.OneHotEncoder
can only be used with categorical integers whileget_dummies
can be used with other type of variables.I've also written a more detailed post based on this answer.