Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath
) where it creates two directories, viz. myModelPath/data
and myModelPath/metadata
. There are multiple files in these paths and those are not text files. There are some files of format *.parquet
.
I have couple of questions:
- What are the format of these files?
- Which file/files contain actual model?
- Can I save the model to somewhere else, for example in a DB?
Spark >= 2.4
Since Spark 2.4 provides format agnostic writer interfaces and selected models already implement these. For example
LinearRegressionModel
:will create a directory with a single file containing PMML representation.
Spark < 2.4
data/*.parquet
files are in Apache Parquet columnar storage formatmetadata/part-*
looks like JSONmodel/*.parquet
I am not aware of any direct method but you can load model as a data frame and store it in a database afterwards: