I built an H2O model in R and saved the POJO code. I want to score parquet files in hdfs using the POJO but I'm not sure how to go about it. I plan on reading the parquet files into spark (scala/SparkR/PySpark) and scoring them on there. Below is the excerpt I found on H2O's documentation page.
"How do I run a POJO on a Spark Cluster?
The POJO provides just the math logic to do predictions, so you won’t find any Spark (or even H2O) specific code there. If you want to use the POJO to make predictions on a dataset in Spark, create a map to call the POJO for each row and save the result to a new column, row-by-row"
Does anyone have some example code of how I can do this? I'd greatly appreciate any assistance. I code primarily in R and SparkR, and I'm not sure how I can "map" the POJO to each line.
Thanks in advance.
I just posted a solution that actually uses DataFrame/Dataset. The post used a Star Wars dataset to build a model in R and then scored MOJO on the test set in Spark. I'll paste the only relevant part here:
Scoring with Spark (and Scala)
You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:
Now in the Spark shell, import the dependencies
Using DataFrame
The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.
Using Dataset
To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.
With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.
If you want to perform scoring with POJO or MOJO in spark you should be using RowData which is provided within h2o-genmodel.jar class as row by row input data to call easyPredict method to generate scores.
Your solution will be to read the parquet file from HDFS and then for each row, convert that to RowData object by filling each entry and then pass that to your POJO scoring function. Remember POJO and MOJO they both use exact same scoring function to score and the only difference is on how the POJO Class is used vs MOJO resources zip package is used. As MOJO are backward compatible and could work with any newer h2o-genmodel.jar it is best if you use MOJO instead of POJO.
Following is the full Scala code you can use on Spark to load a MOJO model and then do the scoring:
Here is an example of reading parquet files in Spark and then saving as CSV. You can use the same code to read the parquet from HDFS and then pass the each row as RowData to above example.
Here is detailed example of using MOJO model in spark and perform scoring using RowData.