Predicting text data labels in test data set with

2019-04-15 03:02发布

问题:

I am using the Weka gui to train a SVM classifier (using libSVM) on a dataset. The data in the .arff file is

@relation Expandtext

@attribute message string 
@attribute Class {positive, negative, objective}

@data

I turn it into a bag of words with String-to-Word Vector, run SVM and get a decent classification rate. Now I have my test data I want to predict their labels which I do not know. Again it's header information is the same but for every class it is labeled with a question mark (?) ie

'Musical awareness: Great Big Beautiful Tomorrow has an ending\u002c Now is the time does not', ?

Again I pre-processed it, string-to-word-vector, class is in the same position as the training data.

I go to the "classify" menu, load up my trained SVM model, select "supplied test data", load in the test data and right click on the model saying "Re-evaluate model on current test set" but it gives me the error that test and train are not compatible. I am not sure why.

Am I going about this the wrong way to label the test data? What am I doing wrong?

回答1:

For almost any machine learning algorithm, the training data and the test data need to have the same format. That means, both must have the same features, i.e. attributes in weka, in the same format, including the class.

The problem is probably that you pre-process the training set and the test set independently, and the StrintToWordVectorFilter will create different features for each set. Hence, the model, trained on the training set, is incompatible to the test set.

What you rather want to do is initialize the filter on the training set and then apply it on both training and test set.

The question Weka: ReplaceMissingValues for a test file deals with this issue, but I'll repeat the relevant part here:

Instances train = ...   // from somewhere
Instances test = ...    // from somewhere
Filter filter = new StringToWordVector(); // could be any filter
filter.setInputFormat(train);  // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter);    // create new test set

Now, you can train the SVM and apply the resulting model on the test data.

If training and testing have to be in separate runs or programs, it should be possible to serialize the initialized filter together with the model. When you load (deserialize) the model, you can also load the filter and apply it on the test data. They should be compatibel now.