naiveBayes and predict function not working in R

2019-09-11 02:31发布

I am doing a sentiment analysis on twitter comments (in Kazakh language) using below R script. 3000 (1500sad, 1500happy) comments for the training set and 1000 (happy sad mixed) comments for the test set. Everything works great but at the end, the predicted values are showing all happy, which is not right.

I have checked every function and all are working up until the naiveBayes function. I checked classifier values and they are correct. I think either naiveBayes or predict is messing things up.

When I used only one happy comment (first on the list) and 1500 sad(negative) comments as training set with this code, predicted results are all happy, which I think should have been sad mostly.

classifier = naiveBayes(mat[1500:3000,], as.factor(sentiment_all[1500:3000]))

However, when I used all sad or negative comments for the training set, the predicted results are all sad.

classifier = naiveBayes(mat[1501:3000,], as.factor(sentiment_all[1501:3000]))

I spent hours and I am completely lost where the problem is. Please help me to solve this issue.

Here is the script:

setwd("Path")
happy = readLines("Path")
sad = readLines("Path")
happy_test = readLines("Path")
sad_test = readLines("Path")

tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ), 
              rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ), 
                   rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))

library(RTextTools)
library(e1071)

# naive bayes
mat= create_matrix(tweet_all, language="kazakh", 
                   removeStopwords=FALSE, removeNumbers=TRUE, 
                   stemWords=FALSE, tm::weightTfIdf)

mat = as.matrix(mat)

classifier = naiveBayes(mat[1:3000,], as.factor(sentiment_all[1:3000]))
predicted = predict(classifier, mat[3001:4000,]); predicted

1条回答
Animai°情兽
2楼-- · 2019-09-11 03:11

Your issue is very basic, you are setting up your problem wrong. Ideally you want a 50-50 split of positives and negatives for your training data. Because of how the Naive Bayes classifier works, it is trying to minimize entropy.

I am guessing that in your case where you have only 1 positive comment, the classifier was able to minimize entropy very easily based on multiple predictors.

Where you use absolutely no positive comments, you are basically saying that the only predicted value/ the only possible outcome is "sad" and that is exactly what your model is doing.

As for your main issue, try a different using a different data set. Where are you getting your tweets from, are they sufficiently diverse?

查看更多
登录 后发表回答