so I'm using Weka Machine Learning Library's JAVA API and I have the following code:
String html = "repeat repeat repeat";
Attribute input = new Attribute("html",(FastVector) null);
FastVector inputVec = new FastVector();
inputVec.addElement(input);
Instances htmlInst = new Instances("html",inputVec,1);
htmlInst.add(new Instance(1));
htmlInst.instance(0).setValue(0, html);
StringToWordVector filter = new StringToWordVector();
filter.setUseStoplist(true);
filter.setInputFormat(htmlInst);
Instances dataFiltered = Filter.useFilter(htmlInst, filter);
Instance last = dataFiltered.lastInstance();
System.out.println(last);
though StringToWordVector is supposed to count the word occurences within the string, instead of having the word 'repeat' counted 3 times, the count only comes out as 1
what am I doing wrong?
Gee... all those lines of code. How about these few lines instead?
Here's the code in action:
Output:
The default setting is only reporting presence/absence as 0/1. You must enable counting explicitly. Add:
filter.setOutputWordCounts(true);
and re-run.
Weka has an explicit mailing list; posting such questions there might give you faster responses.