Suppose I have a dataframe with countries that goes as:
cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0
I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3]
instead.
I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:
[np.where(x) for x in df.cc.get_dummies().values]
This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.
If you are using the
sklearn
library you can useLabelEncoder
. Likepd.Categorical
, input strings are sorted alphabetically before encoding.First, change the type of the column:
Now the data look similar but are stored categorically. To capture the category codes:
Now you have:
If you don't want to modify your DataFrame but simply get the codes:
Or use the categorical column as an index:
If you wish only to transform your series into integer identifiers, you can use
pd.factorize
.Note this solution, unlike
pd.Categorical
, will not sort alphabetically. So the first country will be assigned0
. If you wish to start from1
, you can add a constant:If you wish to sort alphabetically, specify
sort=True
: