I'm trying to perform a one hot encoding of a trivial dataset.
data = [['a', 'dog', 'red']
['b', 'cat', 'green']]
What's the best way to preprocess this data using Scikit-Learn?
On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.
So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.
So, what's the best way to do it in Scikit-Learn?
Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.
For your info this is going to be in production in sklearn pretty soon: See https://github.com/scikit-learn/scikit-learn/pull/9151
If you install the master branch you should be able to do it.
Another way to do it is to use category_encoders.
Here is an example:
Edit: It's already on
master
branch.Edit 2: It is in sklearn==0.20.dev0
I've faced this problem many times and I found a solution in this book at his page 100 :
and the sample code is here :
and as a result : Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.
And you can find more about the LabelBinarizer, here in the sklearn official documentation
Very nice question.
However, in some sense, it is a private case of something that comes up (at least for me) rather often - given
sklearn
stages applicable to subsets of theX
matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.This is a classic case for using the Composite Design Pattern.
Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:
Now, to use it in this context, starting with
you would just use it to map each column index to the transformation you want:
Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.