I am able to do some simple machine learning using scikit-learn and NLTK modules in Python. But I have problems when it comes to training with multiple features that have different value types (number, list of string, yes/no, etc). In the following data, I have a word/phrase column in which I extract the information and create relevant columns (for example, the length column is the character lengths of 'word/phrase'). Label column is the label.
Word/phrase Length '2-letter substring' 'First letter' 'With space?' Label
take action 10 ['ta', 'ak', 'ke', 'ac', 'ct', 'ti', 'io', 'on'] t Yes A
sure 4 ['su', 'ur', 're'] s No A
That wasn't 10 ['th', 'ha', 'at', 'wa', 'as', 'sn', 'nt'] t Yes B
simply 6 ['si', 'im', 'mp', 'pl', 'ly'] s No C
a lot of 6 ['lo', 'ot', 'of'] a Yes D
said 4 ['sa', 'ai', 'id'] s No B
Should I make them into one dictionary and then use sklearn's DictVectorizer
to hold them in a working memory? And then treat these features as one X vector when training the ML algorithms?