I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes:
@attribute label {ham,spam}
@attribute text {'Go until jurong point','Ok lar...', etc.}
Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for training a classifier.
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
I know I can create a String attribute like this:
Attribute tmp = new Attribute("tmp", (FastVector) null);
But I don't know how to replace the current attribute, or set the attribute type before reading in the CSV.
I tried inserting a new String attribute and deleting the current nominal attribute, but this deletes all of the SMS text along with it. I also tried using renameAttributeValue, but this doesn't seem to work for changing the attribute type.
EDIT: I suspect that this NominalToString filter will do the job, but I'm not sure how to use it.
Any suggestions would be much appreciated. Thanks!