The following example shows how one can train a classifier with the Sklearn 20 newsgroups data.
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories)
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
>>> vectors.shape (2034, 34118)
However, I have my own labeled corpus that I would like to use.
After getting a tfidfvector of my own data, would I train a classifier like this?
classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)
To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? How can I then use my TFIDFVectorized corpus to train a classifier?
Thanks!
In general, for sklearn the flow is:
You did not mention your data format but if it is csv file with some rows,flow could be:
And once you have trained classifier you can call predict for new data. Remeber to convert new data to same format as data on which you trained by using above used and fitted vectorizer before passing it to classif.predict.
To address questions from comments; The whole basic process of working with tfidf representation in some classification task you should: