I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I'd like to alter it so that it can count bi-gram frequencies, i.e. pairs of words instead of single words although my attempts have proved unsuccessful at best.
I realise there's alot to look at but any help on this is greatly appreciated. Here's my code:
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
word_list = [punctuation.sub("", word) for word in word_list]
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
for word in word_list2:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result as top 10 most frequent words
freq_list4 =[]
freq_list4=freq_list3[:10]
words = []
for item in freq_list4:
a = str(item[1])
a = a.lower()
words.append(a)
f = open(filename)
newlist = []
for line in f:
line = punctuation.sub("", line)
line = line.lower()
newlist.append(line)
f2 = open('Lines.txt','w')
newlist2= []
for line in newlist:
line = line.split()
newlist2.append(line)
f2.write(str(line))
f2.write("\n")
print newlist2
# ARFF Creation
arff = open('output.arff','w')
arff.write('@RELATION wordfrequency\n\n')
for word in words:
arff.write('@ATTRIBUTE ')
arff.write(str(word))
arff.write(' numeric\n')
arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
arff.write('@DATA\n')
# Counting word frequencies for each verse
for line in newlist2:
word_occurrences = str("")
for word in words:
matches = int(0)
for item in line:
if str(item) == str(word):
matches = matches + int(1)
else:
continue
word_occurrences = word_occurrences + str(matches) + ","
word_occurrences = word_occurrences + "endofworld"
arff.write(word_occurrences)
arff.write("\n")
print words
Life is much more easier if you start using NLTK's FreqDist function to do the counting. Also NLTK has bigram feature. Examples for both of them are in the following page.
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html
I've rewritten the first bit for you, because it's icky. Points to note:
collections.Counter
is great!OK, code:
This should get you started:
Note that the first bigram is
(None, w1)
wherew1
is the first word, so you have a special bigram that marks start-of-text. If you also want an end-of-text bigram, addyield (wprev, None)
after the loop.Generalized to n-grams with optional padding, also uses
defaultdict(int)
for frequencies, to work in 2.6:Output: