I'm learning Scikit-Learn to do some classifying for tweets. I have a csv with tweets on one column, and their class from 0-11 in next column. I went through this tutorial from Scikit-Learn site I think I understand how the actual classifying is done but I don't think I really understood the data format. In tutorial the material was in files in folders where folder names acted as a classification tag.
In my case I should load that data from csv file and apparently I need to construct the datastructure which is feed to vectorizer and classifier manually. How I should approach this? I think the tutorial was a bit ambiguous in this respect since the data loading was done automagically and left me in dark concerning the structure and loading of custom data.
Normally you would use pandas.read_csv or if you don't want a pandas dependency numpy.load or even load the cvs to a list using the standard library. It would look like this:
Once you have your
X
andy
you can feed them to a classifier.