To count the word frequency in multiple documents

2019-08-05 07:44发布

This question already has an answer here:

I have a list of the addresses of multiple text files in a dictionary 'd':

'd:/individual-articles/9.txt', 'd:/individual-articles/11.txt', 'd:/individual-articles/12.txt',...

and so on...

Now, I need to read each file in the dictionary and keep a list of the word occurrences of each and every word that occurs in the entire dictionary.

My output should be of the form:

the-500

a-78

in-56

and so on..

where 500 is the number of times the word "the" occurs in all the files in the dictionary..and so on..

I need to do this for all the words.

I am a python newbie..plz help!

My code below doesn't work,it shows no output!There must be a mistake in my logic, please rectify!!

import collections
import itertools
import os
from glob import glob
from collections import Counter




folderpaths='d:/individual-articles'
counter=Counter()


filepaths = glob(os.path.join(folderpaths,'*.txt'))




folderpath='d:/individual-articles/'
# i am creating my dictionary here, can be ignored
d = collections.defaultdict(list)
with open('topics.txt') as f:
    for line in f:
       value, *keys = line.strip().split('~')
        for key in filter(None, keys):
            if key=='earn':
               d[key].append(folderpath+value+".txt")

   for key, value in d.items() :
        print(value)


word_count_dict={}

for file in d.values():
    with open(file,"r") as f:
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
        for word in words:
            word_count_dict[word].append(counter)              


for word, counts in word_count_dict.values():
    print(word, counts)

2条回答
Deceive 欺骗
2楼-- · 2019-08-05 07:55

Your code should give you an error in this line:

word_count_dict[word][file]+= 1              

Because your word_count_dict is empty, so when you do word_count_dict[word][file] you should get a key error, because word_count_dict[word] doesn't exist, so you can do [file] on it.

And I found another error:

while file in d.items():

This would make file a tuple. But then you do f = open(file,"r"), so you assume file is a string. This would also raise an error.

This means that none of these lines are ever executed. That in turn means that either while file in d.items(): is empty or for file in filepaths: is empty.

And to be honest I don't understand why you have both of them. I don't understand what you are trying to achieve there. You have generated a list of filenames to parse. You should just iterate over them. I also don't know why d is a dict. All you need is a list of all the files. You don't need to keep track of when key the file came from in the topics, list, do you?

查看更多
别忘想泡老子
3楼-- · 2019-08-05 08:17

Inspired from the Counter collection that you use:

from glob import glob
from collections import Counter
import re

folderpaths = 'd:/individual-articles'
counter = Counter()

filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
    with open(file) as f:
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
print counter
查看更多
登录 后发表回答