Text2Vec classification with caret SVM warning mes

2019-08-28 04:49发布

问题:

I am working on a text classification problem with the text2vec package and caret. I am using text2vec to build a document-term matrix before building different models with caret. The goal is to identify string similarity between two strings, using labeled training data.

However, when training a linear SVM model, I get a number of warning messages, excerpt below:

Warning messages: 1: In svm.default(x = as.matrix(x), y = y, kernel = "linear", ... :
Variable(s) ‘influenza’ and ‘perindoprilindapamide’ and ‘bisoprololhct.1’ and ‘creon.1’ and ‘kreon.1’ and ‘paratramadol.1’ constant. Cannot scale data.

Can you please help me to understand these warnings and how to address Cannot scale data?

An excerpt of the original Training Data:

ID          MAKTX_Keyword       PH_Level_04_Keyword   Result 
266325638   AMLODIPINE          AMLODIPINE              0 
724712821   IRBESARTANHCTZ      IRBESARTANHCTZ          0 
567428641   RABEPRAZOLE         RABEPRAZOLE             0 
137472217   MIRTAZAPINE         MIRTAZAPINE             0 
175827784   FONDAPARINUX        ARIXTRA                 1 
456372747   VANCOMYCIN          VANCOMYCIN              0 
653832438   BRUFEN              IBUPROFEN               1 
917575539   POTASSIUM           POTASSIUM               0     
222949123   DIOSMINHESPERIDIN   DIOSMINHESPERIDIN       0 
892725684   IBUPROFEN           IBUPROFEN               0

Code to build SVM Model:

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

Train_PRDHA_String.df$Result <- ifelse(Train_PRDHA_String.df$Result == 1, "X", "Y")

(warn=1)
(warnings=2)

t1 = Sys.time()
svm_Linear <- train(x = as.matrix(dtm_train), y = as.factor(Train_PRDHA_String.df$Result),
                    method = "svmLinear2",
                    trControl=control,
                    tuneLength = 5,
                    metric ="Accuracy")
print(difftime(Sys.time(), t1, units = 'sec'))

回答1:

It means, when these variables are resampled, they only have one unique value. You can use preProc = "zv" to get rid of the warning.