Hi I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the csv format with the following features
DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO
I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.
I am following code from github to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint>
please let me know incase its wrong
// Load and parse the data file.
JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");
// I have 14 features so giving 14 as arg to the following
final HashingTF tf = new HashingTF(14);
// Create LabeledPoint datasets for Actionable and nonactionable
JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
@Override public LabeledPoint call(String alert) {
List<String> featureList = Arrays.asList(alert.trim().split(","));
String actionType = featureList.get(featureList.size() - 1).toLowerCase();
return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
}
});
Similarly above I create testdata and use in the following code to do prediction
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
@Override
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint
please guide I am new to machine learning and Spark MLlib.