Naive Bayes in Quanteda vs caret: wildly different

I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret. However, I can't seem to get caret to work right.

Here is some code for reproduction. First on the quanteda side:

library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)

# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE)

# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE) %>% 
  dfm_select(pattern = training_dfm, 
             selection = "keep")

# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))

# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#>             predicted_class
#> actual_class neg pos
#>          neg 202  47
#>          pos  49 202

Not bad. The accuracy is 80.8% percent without tuning. Now the same (as far as I know) in caret

training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
                  y = as.factor(docvars(training_dfm, "Sentiment")),
                  method = "naive_bayes",
                  trControl = trainControl(method = "none"),
                  tuneGrid = data.frame(laplace = 1,
                                        usekernel = FALSE,
                                        adjust = FALSE),
                  verbose = TRUE)

predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#>             predicted_class_caret
#> actual_class neg pos
#>          neg 246   3
#>          pos 249   2

Not only is the accuracy abysmal here (49.6% - roughly chance), the pos class is hardly ever predicted at all! So I'm pretty sure I'm missing something crucial here, as I would assume the implementations should be fairly similar, but not sure what.

I already looked at the source code for the quanteda function (hoping that it might be built on caret or the underlying package anyway) and saw that there is some weighting and smoothing going on. If I apply the same to my dfm before training (setting laplace = 0 later on), accuracy is a bit better. Yet also only 53%.

标签： r r-caret text-classification supervised-learning quanteda

2条回答

Fickle 薄情

2楼-- · 2019-05-05 06:35

The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
##             nb_e1071_pred
## actual_class neg pos
##          neg 246   3
##          pos 249   2

However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

library("rbenchmark")
benchmark(
    quanteda = { 
        nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
        predicted_class <- predict(nb_quanteda, newdata = test_dfm)
    },
    caret = {
        nb_caret <- train(x = training_m,
                          y = as.factor(docvars(training_dfm, "Sentiment")),
                          method = "naive_bayes",
                          trControl = trainControl(method = "none"),
                          tuneGrid = data.frame(laplace = 1,
                                                usekernel = FALSE,
                                                adjust = FALSE),
                          verbose = FALSE)
        predicted_class_caret <- predict(nb_caret, newdata = test_m)
    },
    e1071 = {
        nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
        nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    },
    replications = 1
)
##       test replications elapsed relative user.self sys.self user.child sys.child
## 2    caret            1  29.042  123.583    25.896    3.095          0         0
## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0

0人赞添加讨论(0) 举报

The star\"

3楼-- · 2019-05-05 06:49

The above answer is correct, I just wanted to add that you can use a Bernoulli distribution with both the 'naivebayes' and 'e1071' package by turning your variables into factors. The output of these should match the 'quanteda' textmodel_nb with a Bernoulli distribution.

Moreover, you could check out: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. This implements a Bernoulli, Multinomial, and Gaussian distribution, works with sparse matrices and is blazingly fast (Fastest currently on CRAN).

0人赞添加讨论(0) 举报

Naive Bayes in Quanteda vs caret: wildly different

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间