目前我正在试图处理lingspam dataset
由600个文件(400个电子邮件200个垃圾邮件),字数统计的次数。 我已经做了每一个字的通用与Porter Stemmer
Aglorithm ,我也想为我的结果在每个文件予以规范作进一步处理。 但我我如何能做到这一点不确定..
因此,资源远
- 8.3。 集合-集装箱数据类型
- 如何算合作ocurrences与collections.Counter()在Python?
- Words模型的包
为了得到下面的输出,我需要能够添加,可能无法在文件中存在的项目,按升序排列。
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0, 'sales', 1)]
然后我就转换成打算vectors
使用numpy
。
[0,0,0]
[2,0,0]
[0,0,0]
代替..
printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]
我怎样才能规范我的成绩从Counter
模块插入Ascending Order
(同时也增加项目的计数结果,可能无法从我的存在search_list
)? 我试过的东西已经低于只是从每个文本文件读取并创建一个基于列表search_list
。
import numpy as np, os
from collections import Counter
def parse_bag(directory, search_list):
words = []
for (dirpath, dirnames, filenames) in os.walk(directory):
for f in filenames:
path = directory + "/" + f
count_words(path, search_list)
return;
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)
谢谢
从你的问题,这听起来像你的要求是要在一致的排序在所有文件中相同的话,与计数。 这应该为你做它:
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = sorted(counter.items())
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
输出样本:
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]
我不认为你想使用most_common
可言,因为你特别不希望每个文件的内容来影响排序或列表的长度。
呼叫Counter(filteredwords)
如你在例子中使用可以指望所有的单词,就像你打算-它不就是给你最常用的-也就是说,没有“most_common”方法-对于您必须重新处理在柜台的所有项目,才能有元组contaning的(频率字)的序列,并进行排序如下:
def most_common(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[1] for item in freq[:n]]
既jsbueno和Mu心灵的结合
def count_words_SO(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = number_parse(counter)
print "printing from " + filename
print wordfreq
def number_parse(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[0] for item in freq[:n]]
隆重推出,只是更多的工作,我会立刻把它准备好Neurel Network
感谢所有:)
printing from ./../lingspam_results/spmsgb19.txt.out
[0, 0, 0]
printing from ./../lingspam_results/spmsgb2.txt.out
[4, 0, 0]
printing from ./../lingspam_results/spmsgb20.txt.out
[10, 0, 0]