I'm trying to perform a one hot encoding of a trivial dataset.
data = [['a', 'dog', 'red']
['b', 'cat', 'green']]
What's the best way to preprocess this data using Scikit-Learn?
On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.
So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.
So, what's the best way to do it in Scikit-Learn?
Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.
For your info this is going to be in production in sklearn pretty soon:
See https://github.com/scikit-learn/scikit-learn/pull/9151
In [30]: cat = CategoricalEncoder()
In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 1.]])
If you install the master branch you should be able to do it.
Another way to do it is to use category_encoders.
Here is an example:
% pip install category_encoders
import category_encoders as ce
le = ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1]])
Edit: It's already on master
branch.
Edit 2: It is in sklearn==0.20.dev0
Very nice question.
However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn
stages applicable to subsets of the X
matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.
This is a classic case for using the Composite Design Pattern.
Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:
class ColumnApplier(object):
def __init__(self, column_stages):
self._column_stages = column_stages
def fit(self, X, y):
for i, k in self._column_stages.items():
k.fit(X[:, i])
return self
def transform(self, X):
X = X.copy()
for i, k in self._column_stages.items():
X[:, i] = k.transform(X[:, i])
return X
Now, to use it in this context, starting with
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
y = np.array([1, 2])
X
you would just use it to map each column index to the transformation you want:
multi_encoder = \
ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))
multi_encoder.fit(X, None).transform(X)
Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.
I've faced this problem many times and I found a solution in this book at his page 100 :
We can apply both transformations (from text categories to integer categories, then from integer
categories to one-hot vectors) in one shot using the LabelBinarizer class:
and the sample code is here :
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(data)
housing_cat_1hot
and as a result :
Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing
sparse_output=True to the LabelBinarizer constructor.
And you can find more about the LabelBinarizer, here in the sklearn official documentation