Label encoding across multiple columns in scikit-l

2019-01-01 14:31发布

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

17条回答
倾城一夜雪
2楼-- · 2019-01-01 14:41

No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).

查看更多
萌妹纸的霸气范
3楼-- · 2019-01-01 14:44

if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original
查看更多
倾城一夜雪
4楼-- · 2019-01-01 14:45

If you have numerical and categorical both type of data in dataframe You can use : here X is my dataframe having categorical and numerical both variables

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])

Note: This technique is good if you are not interested in converting them back.

查看更多
与风俱净
5楼-- · 2019-01-01 14:47

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT:

Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
查看更多
骚的不知所云
6楼-- · 2019-01-01 14:48

After lots of search and experimentation with some answers here and elsewhere, I think your answer is here:

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

This will preserve category names across columns:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0
查看更多
谁念西风独自凉
7楼-- · 2019-01-01 14:49

Mainly used @Alexander answer but had to make some changes -

cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])

Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.

查看更多
登录 后发表回答