I use following code and I get it form Classification using movie review corpus in NLTK/Python
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
output:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0
I want to create my own folder instead of movie_reviews
in nltk, and put my own files inside it.
If you have you data in exactly the same structure as the
movie_review
corpus in NLTK, there are two ways to "hack" your way through:1. Put your corpus directory into where you save the
nltk.data
First check where is your
nltk.data
saved:Then move your directory to where the location where
nltk_data/corpora
is saved:In your python code:
(For more details, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L144)
2. Create your own
CategorizedPlaintextCorpusReader
If you have no access to
nltk.data
directory and you want to use your own corpus, try this:Similar questions has been asked on Creating a custom categorized corpus in NLTK and Python and Using my own corpus for category classification in Python NLTK
Here's the full code that will work: