Apache Spark MLlib Model File Format

2019-01-25 17:47发布

问题:

Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath) where it creates two directories, viz. myModelPath/data and myModelPath/metadata. There are multiple files in these paths and those are not text files. There are some files of format *.parquet.

I have couple of questions:

  • What are the format of these files?
  • Which file/files contain actual model?
  • Can I save the model to somewhere else, for example in a DB?

回答1:

Spark >= 2.4

Since Spark 2.4 provides format agnostic writer interfaces and selected models already implement these. For example LinearRegressionModel:

val lrm: org.apache.spark.ml.regression.LinearRegressionModel = ???
val path: String = ???

lrm.write.format("pmml").save(path)

will create a directory with a single file containing PMML representation.

Spark < 2.4

What are the format of these files?

  • data/*.parquet files are in Apache Parquet columnar storage format
  • metadata/part-* looks like JSON

Which file/files contain actual model?

  • model/*.parquet

Can I save the model to somewhere else, for example in a DB?

I am not aware of any direct method but you can load model as a data frame and store it in a database afterwards:

val modelDf = spark.read.parquet("/path/to/data/")
modelDf.write.jdbc(...)