How do I classify documents with SciKitLearn using

The following example shows how one can train a classifier with the Sklearn 20 newsgroups data.

>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories) 
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data) 
>>> vectors.shape (2034, 34118)

However, I have my own labeled corpus that I would like to use.

After getting a tfidfvector of my own data, would I train a classifier like this?

classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)

To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? How can I then use my TFIDFVectorized corpus to train a classifier?

Thanks!

标签： python machine-learning scikit-learn

2条回答

小情绪 Triste *

2楼-- · 2019-03-22 08:52

In general, for sklearn the flow is:

Convert your string data to numeric values usinf some vectorizer for e.g. TfIDF,count etcs
fit and transform
Pass it to train/fit of your choice of classifier.

You did not mention your data format but if it is csv file with some rows,flow could be:

Read each row of text
Pre process, like remove the stop words etc.
raw_data_list = [row1,row2,rown...]
vectorizer = TfidfVectorizer()
x_transformed = vectorizer.fit_transform(raw_data_list)
x_transformed can be passed to fit/train function of classifiers.

And once you have trained classifier you can call predict for new data. Remeber to convert new data to same format as data on which you trained by using above used and fitted vectorizer before passing it to classif.predict.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-03-22 09:10

To address questions from comments; The whole basic process of working with tfidf representation in some classification task you should:

You fit the vectorizer to your training data and save it in some variable, lets call it tfidf
You transform training data (without labels, just text) through data = tfidf.transform(...)
You fit the model (classifier) using some_classifier.fit( data, labels ), where labels are in the same order as documnents in data
During testing you use tfidf.transform( ... ) on new data, and check the prediction of your model

0人赞添加讨论(0) 举报

How do I classify documents with SciKitLearn using

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间