Pandas: convert categories to numbers

Suppose I have a dataframe with countries that goes as:

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead.

I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.

标签： python pandas series categorical-data

3条回答

人气声优

2楼-- · 2019-01-02 21:19

If you are using the sklearn library you can use LabelEncoder. Like pd.Categorical, input strings are sorted alphabetically before encoding.

from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])

print(df)

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

0人赞添加讨论(0) 举报

怪性笑人.

3楼-- · 2019-01-02 21:32

First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

0人赞添加讨论(0) 举报

梦寄多情

4楼-- · 2019-01-02 21:37

If you wish only to transform your series into integer identifiers, you can use pd.factorize.

Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:

df['code'] = pd.factorize(df['cc'])[0] + 1

print(df)

   cc  temp  code
0  US  37.0     1
1  CA  12.0     2
2  US  35.0     1
3  AU  20.0     3

If you wish to sort alphabetically, specify sort=True:

df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1

0人赞添加讨论(0) 举报

Pandas: convert categories to numbers

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间