Label encoding across multiple columns in scikit-l

2019-01-01 14:31发布

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

17条回答
宁负流年不负卿
2楼-- · 2019-01-01 14:32

Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:

le.fit(df.columns)

In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column's encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:

le.transform(df.columns.get_values())

Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:

le.fit([y for x in df.get_values() for y in x])

In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:

le.transform([df.get_value(0, df.columns[0])])

The question you had in your comment is a bit more complicated, but can still be accomplished:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

The above code does the following:

  1. Make a unique combination of all of the pairs of (column, row)
  2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
  3. Fits the new items to the LabelEncoder.

Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

Remember that each lookup is now a string representation of a tuple that contains the (column, row).

查看更多
人气声优
3楼-- · 2019-01-01 14:35

We don't need a LabelEncoder.

You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
查看更多
旧时光的记忆
4楼-- · 2019-01-01 14:36

The problem is the shape of the data (pd dataframe) you are passing to the fit function. You've got to pass 1d list.

查看更多
不再属于我。
5楼-- · 2019-01-01 14:38

It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.

First, let's make a dictionary of dictionaries mapping the columns and their values to their new replacement values.

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

We can easily go back to the original by again chaining the replace method

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog
查看更多
何处买醉
6楼-- · 2019-01-01 14:40

A short way to LabelEncoder() multiple columns with a dict():

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

and you can use this le_dict to labelEncode any other column:

le_dict[col].transform(df_another[col])
查看更多
春风洒进眼中
7楼-- · 2019-01-01 14:41

this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)

However, for the purpose of a few classification tasks etc. you could use

pandas.get_dummies(input_df) 

this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more

查看更多
登录 后发表回答