I am training the NaiveBayesClassifier
in Python using sentences, and it gives me the error below. I do not understand what the error might be, and any help would be good.
I have tried many other input formats, but the error remains. The code given below:
from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg') ]
test = [('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)
I am including the traceback below.
Traceback (most recent call last):
File "C:\Users\5460\Desktop\train01.py", line 15, in <module>
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr>
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize
return _word_tokenize(text)
File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize
text = re.sub(r'^\"', r'``', text)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
@275365's tutorial on the data structure for NLTK's bayesian classifier is great. From a more high level, we can look at it as,
We have inputs sentences with sentiment tags:
Let's consider our feature sets to be individual words, so we extract a list of all possible words from the training data (let's call it vocabulary) as such:
Essentially,
vocabulary
here is the same @275365'sall_word
From each data point, (i.e. each sentence and the pos/neg tag), we want to say whether a feature (i.e. a word from the vocabulary) exist or not.
But we also want to tell the classifier which word don't exist in the sentence but in the vocabulary, so for each data point, we list out all possible words in the vocabulary and say whether a word exist or not:
But since this loops through the vocabulary twice, it's more efficient to do this:
So for each sentence, we want to tell the classifier for each sentence which word exist and which word doesn't and also give it the pos/neg tag. We can call that a
feature_set
, it's a tuple made up of ax
(as shown above) and its tag.Then we feed these features and tags in the feature_set into the classifier to train it:
Now you have a trained classifier and when you want to tag a new sentence, you have to "featurize" the new sentence to see which of the word in the new sentence are in the vocabulary that the classifier was trained on:
NOTE: As you can see from the step above, the naive bayes classifier cannot handle out of vocabulary words since the
foobar
token disappears after you featurize it.Then you feed the featurized test sentence into the classifier and ask it to classify:
Hopefully this gives a clearer picture of how to feed data in to NLTK's naive bayes classifier for sentimental analysis. Here's the full code without the comments and the walkthrough:
It appears that you are trying to use TextBlob but are training the NLTK NaiveBayesClassifier, which, as pointed out in other answers, must be passed a dictionary of features.
TextBlob has a default feature extractor that indicates which words in the training set are included in the document (as demonstrated in the other answers). Therefore, TextBlob allows you to pass in your data as is.
Of course, the simple default extractor is not appropriate for all problems. If you would like to how features are extracted, you just write a function that takes a string of text as input and outputs the dictionary of features and pass that to the classifier.
I encourage you to check out the short TextBlob classifier tutorial here: http://textblob.readthedocs.org/en/latest/classifiers.html
You need to change your data structure. Here is your
train
list as it currently stands:The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:
Your data should now be structured like this:
Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:
If you want to use the classifier, you can do it like this. First, you begin with a test sentence:
Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.
Your features will now look like this:
Then you simply classify those features:
This test sentence appears to be positive.