I'm trying to use scikit-learn's LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder
object for each column; I'd rather just have one big LabelEncoder
objects that works across all my columns of data.
Throwing the entire DataFrame
into LabelEncoder
creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)
Any thoughts on how to get around this problem?
No,
LabelEncoder
does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python
If you have numerical and categorical both type of data in dataframe You can use : here X is my dataframe having categorical and numerical both variables
Note: This technique is good if you are not interested in converting them back.
You can easily do this though,
EDIT:
Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.
For inverse_transform and transform, you have to do a little bit of hack.
With this, you now retain all columns
LabelEncoder
as dictionary.After lots of search and experimentation with some answers here and elsewhere, I think your answer is here:
This will preserve category names across columns:
Mainly used @Alexander answer but had to make some changes -
Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the
.map()
function like I did above.