How to consistently hot encode dataframes with cha

I'm getting a stream of content in the form of dataframes, each batch with different values in columns. For example one batch might look like:

day1_data = {'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
            'city': ['C', 'B', 'G', 'Z', 'F'], 
            'age': [27, 19, 63, 40, 93]}

and another one like:

day2_data = {'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]}

how can the columns be hot encoded in a way that returns a consistent number of columns?

If I use pandas's get_dummies() on each of the batches, it returns a different number of columns:

df1 = pd.get_dummies(pd.DataFrame(day1_data))
df2 = pd.get_dummies(pd.DataFrame(day2_data))

len(df1.columns) == len(df2.columns)

I can get all the possible values for each column, the question is even with that information what's the simplest way to generate the one hot encoding for each daily batch so the number of columns will be consistent?

标签： python pandas scikit-learn one-hot-encoding

1条回答

爷、活的狠高调

2楼-- · 2019-05-28 11:14

Okay since all the possible values are known in advance. Then below is a slightly hackish way of doing it.

import numpy as np
import pandas as pd

# This is a one time process
# Keep all the possible data here in lists
# Can add other categorical variables too which have this type of data
all_possible_states=  ['AL', 'MS', 'MS', 'OK', 'VA', 'NJ', 'NM', 'CD', 'WY']
all_possible_cities= ['A', 'B', 'C', 'D', 'E', 'G', 'Z', 'F']

# Declare our transformer class
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

class MyOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, all_possible_values):
        self.le = LabelEncoder()
        self.ohe = OneHotEncoder()
        self.ohe.fit(self.le.fit_transform(all_possible_values).reshape(-1,1))

    def transform(self, X, y=None):
        return self.ohe.transform(self.le.transform(X).reshape(-1,1)).toarray()

# Allow the transformer to see all the data here
encoders = {}
encoders['state'] = MyOneHotEncoder(all_possible_states)
encoders['city'] = MyOneHotEncoder(all_possible_cities)
# Do this for all categorical columns

# Now this is our method which will be used on the incoming data 
def encode(df):

    tup = (encoders['state'].transform(df['state']), 
           encoders['city'].transform(df['city']),
           # Add all other columns which are not to be transformed
           df[['age']])

    return np.hstack(tup)

# Testing:
day1_data = pd.DataFrame({'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
        'city': ['C', 'B', 'G', 'Z', 'F'], 
        'age': [27, 19, 63, 40, 93]})

print(encode(day1_data))
[[  0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.
    0.   0.  27.]
 [  0.   0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   0.   0.   0.
    0.   0.  19.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    1.   0.  63.]
 [  0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   1.  40.]
 [  0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   1.
    0.   0.  93.]]


day2_data = pd.DataFrame({'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]})

print(encode(day2_data))
[[  1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.  42.]
 [  0.   0.   0.   0.   0.   0.   0.   1.   0.   1.   0.   0.   0.   0.
    0.   0.  52.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   1.   0.
    0.   0.  73.]]

Do go through the comments and if still any issue, ask me.

0人赞添加讨论(0) 举报

How to consistently hot encode dataframes with cha

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间