I'm trying to use scikit-learn's LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder
object for each column; I'd rather just have one big LabelEncoder
objects that works across all my columns of data.
Throwing the entire DataFrame
into LabelEncoder
creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)
Any thoughts on how to get around this problem?
Assuming you are simply trying to get a
sklearn.preprocessing.LabelEncoder()
object that can be used to represent your columns, all you have to do is:In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of
df.columns
tole.transform(df.columns.get_values())
. To get a column's encoding, simply pass it tole.transform(...)
. As an example, the following will get the encoding for each column:Assuming you want to create a
sklearn.preprocessing.LabelEncoder()
object for all of your row labels you can do the following:In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do
le.classes_
. You'll note that this should have the same elements as inset(y for x in df.get_values() for y in x)
. Once again to convert a row label to an encoded label usele.transform(...)
. As an example, if you want to retrieve the label for the first column in thedf.columns
array and the first row, you could do this:The question you had in your comment is a bit more complicated, but can still be accomplished:
The above code does the following:
LabelEncoder
class not supporting tuples as a class name.LabelEncoder
.Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
Remember that each lookup is now a string representation of a tuple that contains the (column, row).
We don't need a LabelEncoder.
You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.
To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:
The problem is the shape of the data (pd dataframe) you are passing to the fit function. You've got to pass 1d list.
It is possible to do this all in pandas directly and is well-suited for a unique ability of the
replace
method.First, let's make a dictionary of dictionaries mapping the columns and their values to their new replacement values.
Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.
Now, we can use the unique ability of the
replace
method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.We can easily go back to the original by again chaining the
replace
methodA short way to
LabelEncoder()
multiple columns with adict()
:and you can use this
le_dict
to labelEncode any other column:this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)
However, for the purpose of a few classification tasks etc. you could use
this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more