Does it make sense to use both countvectorizer and

2019-07-19 07:47发布

I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?

And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.

here is my code:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

标签： python machine-learning scipy scikit-learn feature-extraction

1条回答

再贱就再见

2楼-- · 2019-07-19 08:19

Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.

This is how you can horizontally stack your features:

import scipy.sparse

X = scipy.sparse.hstack([vectorized, count_vectorized])

Then you can just do:

model.fit(X, y)  # y is optional in some models

0人赞添加讨论(0) 举报

Does it make sense to use both countvectorizer and

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间