As you can see below, when I open test.txt and put the words into a set, the difference of the set with the common_words set is returned. However, it is only removing a single instance of the words in the common_words set rather than all occurrences of them. How can I achieve this? I want to remove ALL instances of items in common_words from title_words
from string import punctuation
from operator import itemgetter
N = 10
words = {}
linestring = open('test.txt', 'r').read()
//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))
title = linestring
//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
words_gen = (word.strip(punctuation).lower() for line in keywords
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
Strip punctuation out before you make it a set, you do:
Which tries to call the
strip
method of thetitle_words
, which is aset
(onlystr
has this method). You could do something like this instead:If
title_words
is a set, then there is only one occurrence of any one word. So you only need to remove one occurrence. Have I misunderstood your question?I'm still a bit confused by this question, but I notice that one problem might be that when you pass your initial data through
set
, the punctuation hasn't been stripped yet. So there may be multiple punctuated versions of a word slipping through the.difference()
operation. Try this:Also, your
words_gen
generator is written in a slightly confusing way. Whyline in keywords
-- what's the line? And why are you callingsplit()
again?keywords
ought to be a set of straight words, right?I agree with senderle. Try this code:
That should do it
Hope this helps
You just want the
difference()
method for this, but it looks like your example is buggy.title_words
is a set, and doesn't have thestrip()
method.Try this instead:
You've succeeded in finding the top N most uniquely punctuated words in your input file.
Run this input file through your original code:
And you'll get the following output:
Notice that
fox
appears in 4 variations:fox
,fox?
,fox!
, andfox.
The wordbrown
appears only one way. Andquick
appears only with and without a comma (2 variations).What happens when we add
fox
to thecommon_words
set? Only the variation that has no trailing punctuation is removed, and we're left with the three punctuation-adorned variants, giving this output:For a more realistic example, run MLK's I Have a Dream speech through your method:
Dr. King says "I Have a Dream" eight times in that speech, yet
dream
doesn't show up at all on the list. Do a search forjustice
and you'll find four (4) punctuated flavors:So what went wrong? It looks like this method has been through a lot of rework, considering the names of the variables don't seem to match their purpose. So let's go through (moving some code around a bit, my apologies):
Open the file and slurp the whole thing into
linestring
, good so far except for the variable name:Is this a line or a title? Both? In any event, we now lowercase the whole file and split it up by whitespace. Using my test file, this means title_words now contains
fox?
,fox!
,fox
, andfox.
Now the attempt to remove the common words. Let's assume our common_words contains
fox
. This next line removesfox
but leavesfox?
,fox!
, andfox.
The next line really looks legacy to me, as if it was meant to be something like
for line in linestring.split('\n') for word in line.split()
. In the current form,keywords
is just a list of words, soline
is just a word without spaces, sofor word in line.split()
has no effect. We just iterate over every word, remove punctuation, and make it lowercase.words_gen
now contains 3 copies of fox:fox
,fox
,fox
. We've removed the one un-punctuated version.The frequency analysis is pretty spot-on. This creates a histogram based on the words in the words_gen generator. Which ultimately gives us the N most uniquely punctuated words! In this example,
fox=3
:So there's the what-went-wrong. Others have posted clear solutions for word frequency analysis, but I'm in a bit of a performance frame of mind, and came up with my own variant. First, split the text into words using a regular expression:
Now based on Python Performance Tips for Initializing Dictionary Entries (and my own measured performance), find the top N most frequent words:
And to close with Dr. King's speech with all articles and pronouns removed:
And, for kicks, my performance measurments:
Cheers, E
Not ideal, but works as a word frequency counter (which is what this appears to be aiming at):