I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt
and neg_1.txt
), but I would prefer to create directories I could dump files into.
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words=nltk.FreqDist(
w.lower()
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
This returns:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
---------UPDATE------------- Thanks alvas for your detailed answer! I have two questions, however.
- Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the
review_pos.txt
method, only grabbing thepos
from the folder name rather than the file name. I ran your code and am experiencing a syntax error on
train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
with the carrot under the first for
. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.
----UPDATE 2---- Error is
File "review.py", line 17
for i in word_features}, tag)
^
SyntaxError: invalid syntax`
Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.
Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.
[out]:
Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:
Next is the error at
FreqDist(for w in movie_reviews.words() ...)
. There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:[outputs]:
Since the above code prints the
FreqDist
correctly, the error seems like you do not have the files innltk_data/
directory.The fact that you have
fic/11.txt
suggests that you're using some older version of the NLTK or NLTK corpora. Normally thefileids
inmovie_reviews
, starts with eitherpos
/neg
then a slash then the filename and finally.txt
, e.g.pos/cv001_18431.txt
.So I think, maybe you should redownload the files with:
Then make sure that the movie review corpus is properly downloaded under the corpora tab:
Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:
Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:
I would recommend this instead:
Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:
[out]: