Load classified data from CSV to Scikit-Learn for

2020-06-24 01:36发布

问题:

I'm learning Scikit-Learn to do some classifying for tweets. I have a csv with tweets on one column, and their class from 0-11 in next column. I went through this tutorial from Scikit-Learn site I think I understand how the actual classifying is done but I don't think I really understood the data format. In tutorial the material was in files in folders where folder names acted as a classification tag.

In my case I should load that data from csv file and apparently I need to construct the datastructure which is feed to vectorizer and classifier manually. How I should approach this? I think the tutorial was a bit ambiguous in this respect since the data loading was done automagically and left me in dark concerning the structure and loading of custom data.

回答1:

Normally you would use pandas.read_csv or if you don't want a pandas dependency numpy.load or even load the cvs to a list using the standard library. It would look like this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('example.csv', header=None, sep=',', 
                 names=['tweets', 'class'])   # columns names if no header
vect = TfidfVectorizer()
X = vect.fit_transform(df['tweets']) 
y = df['class']

Once you have your X and y you can feed them to a classifier.