Machine learning with multiple feature types in py

2019-06-08 20:44发布

问题:

I am able to do some simple machine learning using scikit-learn and NLTK modules in Python. But I have problems when it comes to training with multiple features that have different value types (number, list of string, yes/no, etc). In the following data, I have a word/phrase column in which I extract the information and create relevant columns (for example, the length column is the character lengths of 'word/phrase'). Label column is the label.

Word/phrase Length  '2-letter substring'    'First letter'  'With space?'       Label
take action 10  ['ta', 'ak', 'ke', 'ac', 'ct', 'ti', 'io', 'on']    t   Yes     A
sure    4   ['su', 'ur', 're']  s   No      A
That wasn't     10  ['th', 'ha', 'at', 'wa', 'as', 'sn', 'nt']  t   Yes     B
simply  6   ['si', 'im', 'mp', 'pl', 'ly']  s   No      C
a lot of    6   ['lo', 'ot', 'of']  a   Yes     D
said    4   ['sa', 'ai', 'id']  s   No      B

Should I make them into one dictionary and then use sklearn's DictVectorizer to hold them in a working memory? And then treat these features as one X vector when training the ML algorithms?

回答1:

Majority of machine learning algorithms work with numbers, so you can to transform your categorical values and string into numbers.

Popular python machine-learning library scikit-learn has the whole chapter dedicated to preprocessing of the data. With 'yes/no' everything is easy - just put 0/1 instead of it.

Among many other important things it explains the process of categorical data preprocessing using their OneHotEncoder.

When you work with text, you also have to transform your data in a suitable way. One of the common feature extraction strategy for text is a tf-idf score, and I wrote a tutorial here.