Quanteda package, Naive Bayes: How can I predict o

2019-04-13 11:15发布

问题:

I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer.

Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error:

Error in predict.textmodel_NB_fitted(model, test_dfm) : 
feature set in newdata different from that in training set

The code in the function that generates the error can be found here at lines 157 to 165.

I assume this occurs because the words in the training data set do not exactly match the words used in the test data set. But why does this error occur? I feel as if—to be useful in real-world examples—the model should be able to handle data sets that contain different features, as this is what will probably always happen in applied use.

So my first question is:

1. Is this error a property of the naive Bayes algorithm? Or was it a choice made by the author of the function to do this?

Which then leads me to my second question:

2. How can I remedy this issue?

To get at this second question, I provide reproducible code (the last line generates the error above):

library(quanteda)
library(magrittr)
library(data.table)

train_text <- c("Can random effects apply only to categorical variables?",
                "ANOVA expectation identity",
                "Statistical test for significance in ranking positions",
                "Is Fisher Sharp Null Hypothesis testable?",
                "List major reasons for different results from survival analysis among different studies",
                "How do the tenses and aspects in English correspond temporally to one another?",
                "Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
                "Are collective nouns always plural, or are certain ones singular?",
                "What’s the rule for using “who” and “whom” correctly?",
                "When is a gerund supposed to be preceded by a possessive adjective/determiner?")

train_class <- factor(c(rep(0,5), rep(1,5)))

train_dfm <- train_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

model <- textmodel_NB(train_dfm, train_class)

test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
               "What do significance tests for adjusted means tell us?",
               "How should I punctuate around quotes?",
               "Should I put a comma before the last item in a list?")

test_dfm <- test_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

predict(model, test_dfm)

The only thing I have thought to do was to manually make the features the same (I assumed that this would fill in 0 for features that are not present in the object), but this generated a new error. The code for the example above is:

model_features <- model$data$x@Dimnames$features # gets the features of the training data

test_features <- test_dfm@Dimnames$features # gets the features of the test data

all_features <- c(model_features, test_features) %>% # combining the two sets of features...
  subset(!duplicated(.)) # ...and getting rid of duplicate features

model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features

predict(model, dfm) # new error?

However, this code generates a new error:

Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") : 
  argument is of length zero

How do I apply this naive Bayes model to a new data set with different features?

回答1:

Fortunately there is an easy method to do this: You can use dfm_select() on your test data to give identical features (and ordering of features) to the training set. It's this simple:

test_dfm <- dfm_select(test_dfm, train_dfm)
predict(model, test_dfm)
## Predicted textmodel of type: Naive Bayes
## 
##             lp(0)       lp(1)     Pr(0)  Pr(1) Predicted
## text1  -0.6931472  -0.6931472    0.5000 0.5000         0
## text2 -11.8698712 -13.1879095    0.7889 0.2111         0
## text3  -4.1484118  -3.6635616    0.3811 0.6189         1
## text4  -8.0091415  -8.4257356    0.6027 0.3973         0