Want to know the diff among pd.factorize, pd.get_d

2019-02-03 05:18发布

All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.

Please correct me if I am wrong! Thank you!

1条回答
我只想做你的唯一
2楼-- · 2019-02-03 05:53

These four encoders can be split in two categories:

  • Encode labels into categorical variables: Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
  • Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

Encode labels into categorical variables

Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

from sklearn import preprocessing    
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])

print(df)
#   Col  Fact  Lab
# 0   A     0    0
# 1   B     1    1
# 2   B     1    1
# 3   C     2    2

Encode categorical variable into dummy/indicator (binary) variables

Pandas get_dummies and scikit-learn OneHotEncoder belong to the second category. They can be used to create binary variables. OneHotEncoder can only be used with categorical integers while get_dummies can be used with other type of variables.

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)

print(df)
#    Col_A  Col_B  Col_C
# 0    1.0    0.0    0.0
# 1    0.0    1.0    0.0
# 2    0.0    1.0    0.0
# 3    0.0    0.0    1.0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())

print(df)
#      0    1    2
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  1.0  0.0
# 3  0.0  0.0  1.0

I've also written a more detailed post based on this answer.

查看更多
登录 后发表回答