first time posting in stack - always found previous questions capable enough of solving my prob! Main problem I have is the logic... even a pseudo code answer would be great.
I'm using python to read in data from each line of a text file, in the format:
This is a tweet captured from the twitter api #hashtag http://url.com/site
Using nltk, I can tokenize by line then can use reader.sents() to iterate through etc:
reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())
reader.sents()[:10]
But I would like to count the frequency of certain 'hot words' (stored in an array or similar) per line, then write them back to a text file. If I used reader.words(), i could count up the frequency of 'hot words' in the entire text, but i'm looking for the amount per line (or 'sentence' in this case).
Ideally, something like:
hotwords = (['tweet'], ['twitter'])
for each line
tokenize into words.
for each word in line
if word is equal to hotword[1], hotword1 count ++
if word is equal to hotword[2], hotword2 count ++
at end of line, for each hotword[index]
filewrite count,
Also, not so worried about URL becoming broken (using WordPunctTokenizer would remove the punctuation - thats not an issue)
Any useful pointers (including pseudo or links to other similar code) would be great.
---- EDIT ------------------
Ended up doing something like this:
import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tokenize import LineTokenizer
#from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict
# Create reader and generate corpus from all txt files in dir.
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())
print "Reader accessible."
print filereader.fileids()
#define hotwords
hotwords = ('cool','foo','bar')
tweetdict = []
for line in filereader.sents():
wordcounts = defaultdict(int)
for word in line:
if word in hotwords:
wordcounts[word] += 1
tweetdict.append(wordcounts)
Output is:
print tweetdict
[defaultdict(<type 'dict'>, {}),
defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}),
defaultdict(<type 'int'>, {'cool': 1})]