I'm getting a stream of content in the form of dataframes, each batch with different values in columns. For example one batch might look like:
day1_data = {'state': ['MS', 'OK', 'VA', 'NJ', 'NM'],
'city': ['C', 'B', 'G', 'Z', 'F'],
'age': [27, 19, 63, 40, 93]}
and another one like:
day2_data = {'state': ['AL', 'WY', 'VA'],
'city': ['A', 'B', 'E'],
'age': [42, 52, 73]}
how can the columns be hot encoded in a way that returns a consistent number of columns?
If I use pandas's get_dummies() on each of the batches, it returns a different number of columns:
df1 = pd.get_dummies(pd.DataFrame(day1_data))
df2 = pd.get_dummies(pd.DataFrame(day2_data))
len(df1.columns) == len(df2.columns)
I can get all the possible values for each column, the question is even with that information what's the simplest way to generate the one hot encoding for each daily batch so the number of columns will be consistent?
Okay since all the possible values are known in advance. Then below is a slightly hackish way of doing it.
Do go through the comments and if still any issue, ask me.