I'm trying to use scikit-learn's LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder
object for each column; I'd rather just have one big LabelEncoder
objects that works across all my columns of data.
Throwing the entire DataFrame
into LabelEncoder
creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)
Any thoughts on how to get around this problem?
Since scikit-learn 0.20 you can use
sklearn.compose.ColumnTransformer
andsklearn.preprocessing.OneHotEncoder
:If you only have categorical variables,
OneHotEncoder
directly:If you have heterogeneously typed features:
More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).
Very Rough ideas... first, identify which columns needed LabelEncoder, then loop through each column.
The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.
EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)
As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart's excellent blog post found here.
Creating a custom encoder involves simply creating a class that responds to the
fit()
,transform()
, andfit_transform()
methods. In your case, a good start might be something like this:Suppose we want to encode our two categorical attributes (
fruit
andcolor
), while leaving the numeric attributeweight
alone. We could do this as follows:Which transforms our
fruit_data
dataset fromto
Passing it a dataframe consisting entirely of categorical variables and omitting the
columns
parameter will result in every column being encoded (which I believe is what you were originally looking for):This transforms
to
.
Note that it'll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).
Another nice feature about this is that we can use this custom transformer in a pipeline:
Following up on the comments raised on the solution of @PriceHardman I would propose the following version of the class:
This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.
This is a year-and-a-half after the fact, but I too, needed to be able to
.transform()
multiple pandas dataframe columns at once (and be able to.inverse_transform()
them as well). This expands upon the excellent suggestion of @PriceHardman above:Example:
If
df
anddf_copy()
are mixed-typepandas
dataframes, you can apply theMultiColumnLabelEncoder()
to thedtype=object
columns in the following way:You can access individual column classes, column labels, and column encoders used to fit each column via indexing:
mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_