I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes:
@attribute label {ham,spam}
@attribute text {'Go until jurong point','Ok lar...', etc.}
Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for training a classifier.
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
I know I can create a String attribute like this:
Attribute tmp = new Attribute("tmp", (FastVector) null);
But I don't know how to replace the current attribute, or set the attribute type before reading in the CSV.
I tried inserting a new String attribute and deleting the current nominal attribute, but this deletes all of the SMS text along with it. I also tried using renameAttributeValue, but this doesn't seem to work for changing the attribute type.
EDIT: I suspect that this NominalToString filter will do the job, but I'm not sure how to use it.
Any suggestions would be much appreciated. Thanks!
This did the trick. It changed the text attribute type, but not the label attribute type (though I'm not sure why it did one but not the other).
There's a small tip here