根据出现在文本语料库列表中的单词的词汇量，Scikit-学习(List the words in a

2019-08-31 11:03发布

我已经安装一个CountVectorizer在一些文件scikit-learn 。我希望看到所有的条款和文本语料库其对应的频率，以便选择停止字。例如

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

对此有任何内置的功能？

Answer 1:

如果cv是你的CountVectorizer和X是矢量语料库，然后

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

返回的列表(term, frequency)对用于在该语料库的每个不同的术语CountVectorizer萃取。

（小asarray + ravel的舞蹈是需要解决的一些怪癖scipy.sparse 。）

Answer 2:

没有内置。我已经找到了更快的方法基础上做安藤Saabas的回答：

from sklearn.feature_extraction.text import CountVectorizer 
texts = ["Hello world", "Python makes a better world"]
vec = CountVectorizer().fit(texts)
bag_of_words = vec.transform(texts)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)

产量

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]

文章来源: List the words in a vocabulary according to occurrence in a text corpus , Scikit-Learn

根据出现在文本语料库列表中的单词的词汇量，Scikit-学习(List the words in a

Answer 1:

Answer 2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮