Given a pandas dataFrame that looks like this:
| | c_0337 | c_0348 | c_0351 | c_0364 |
|-------|:------:|-------:|--------|--------|
| id | | | | |
| 11193 | a | f | o | a |
| 11382 | a | k | s | a |
| 16531 | b | p | f | b |
| 1896 | a | f | o | NaN |
I am trying to convert the categorical variables to numeric (preferably binary true false columns) I tried using the OneHotEncoder from scikit learn as follows:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([c4k.ix[:,'c_0327':'c_0351'].values])
OneHotEncoder(categorical_features='all',
n_values='auto', sparse=True)
That just gave me: invalid literal for long() with base 10: 'f'
I need to get the data into an array acceptable to Scikit learn, with columns being created with false for most entries (eg very sparse) true for the created column that contains the corresponding letter?
with NaN being 0=false
I suspect I'm way off here? Like not even using the right preprocessor?
Brand new at this so any pointers appreciated the actual dataset has over 1000 such columns...... So then I tried using DictVectorizer as follows:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
#fill df with zeros Since we don't want NaN
c4kNZ=c4k.ix[:,'c_0327':'c_0351'].fillna(0)
#Make the dataFrame a Dict
c4kb=c4kNZ.to_dict()
sdata = vec.fit_transform(c4kb)
It gives me float() argument must be a string or a number – I rechecked the dict and it looks ok to me but I guess I have not gotten it formatted correctly?