我想获得最高频率方面出scikit学习载体。 从例如,它可以使用这个对每个类别做,但我想它里面每个类别的文件。
https://github.com/scikit-learn/scikit-learn/blob/master/examples/document_classification_20newsgroups.py
if opts.print_top10:
print "top 10 keywords per class:"
for i, category in enumerate(categories):
top10 = np.argsort(clf.coef_[i])[-10:]
print trim("%s: %s" % (
category, " ".join(feature_names[top10])))
我想每个文件做到这一点从测试数据集,而不是每个类别。 我应该在哪里寻找?
谢谢
编辑:S / discrimitive /最高频率/克(很抱歉的混乱)
您可以使用的结果连同变换get_feature_names
获得长期计数给定文档。
X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])
似乎没有人知道。 我在这里回答其他人面临同样的问题,我有哪里看现在,都没有完全实现它。
这里面CountVectorizer深藏从sklearn.feature_extraction.text:
def transform(self, raw_documents):
"""Extract token counts out of raw text documents using the vocabulary
fitted with fit or the one provided in the constructor.
Parameters
----------
raw_documents: iterable
an iterable which yields either str, unicode or file objects
Returns
-------
vectors: sparse matrix, [n_samples, n_features]
"""
if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
raise ValueError("Vocabulary wasn't fitted or is empty!")
# raw_documents can be an iterable so we don't know its size in
# advance
# XXX @larsmans tried to parallelize the following loop with joblib.
# The result was some 20% slower than the serial version.
analyze = self.build_analyzer()
term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
return self._term_count_dicts_to_matrix(term_counts_per_doc)
我已经加入self.test_term_counts_per_doc = deepcopy的(term_counts_per_doc),并使其能够从外界向量化调用是这样的:
load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)
data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'
categories = None # for case categories == None
print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print
# split a training set and a test set
print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
stop_words='english',charset_error="ignore")
X_train = vectorizer.fit_transform(data_train.data)
print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print
print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
print counter
这里是我的叉子,我也提交了引入请求:
https://github.com/v3ss0n/scikit-learn
请给我建议,如果有什么更好的方式来做到。
文章来源: How can i get highest frequency terms out of TD-idf vectors , for each files in scikit-learn?