Java, Weka: How to predict numeric attribute?

2020-07-11 05:20发布

问题:

I was trying to use NaiveBayesUpdateable classifier from Weka. My data contains both nominal and numeric attributes:

  @relation cars
  @attribute country {FR, UK, ...}
  @attribute city {London, Paris, ...}
  @attribute car_make {Toyota, BMW, ...}
  @attribute price numeric   %% car price 
  @attribute sales numeric   %% number of cars sold

I need to predict the number of sales (numeric!) based on other attributes.

I understand that I can not use numeric attribute for Bayes classification in Weka. One technique is to split value of numeric attribute in N intervals of length k and use instead nominal attribute, where n is a class name, like this: @attribute class {1,2,3,...N}.

Yet numeric attribute that I need to predict ranges from 0 to 1 000 000. Creating 1 000 000 classes make no sense at all. How to predict numeric attribute with Weka or what algorithms to look for in case Weka has no tools for this task?

回答1:

What you want to do is regression, not classification. The difference is exactly what you describe/want:

  • Classification has discrete classes/labels, any nominal attribute could be used as class here
  • Regression has continuous labels, classes would be a wrong term here.

Most regression based techniques can be transformed into a binary classification by defining a threshold and the class is determined by whether the predicted value is above or below this threshold.

I don't know all of WEKA's classifiers that offer regression, but you can start by looking at those two:

  • MultilayerPerceptron: Basically a neural network.
  • LinearRegression: As the name says, linear regression.

You might have to use the NominalToBinary filter to convert your nominal attributes to numerical (binary) ones.



回答2:

you can find use regression in weka classifiers > functions > linear regression. here is an example of creating a regression model in weka https://www.ibm.com/developerworks/opensource/library/os-weka1/



回答3:

These days, I believe first introduced in Weka 3.7, RandomForest would work just as you want it. The features can be a mix of nominal and numeric and the prediction is allowed to be numeric as well.

The drawback (I would imagine in your case) is that it is not an Updateable class as NaiveBayesUpdateable works well with large amounts of data that may not fit in memory all at once.