I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick?
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=nb_tree,featureSubsetStrategy="auto",
impurity='variance', maxDepth=depth)
output = open('model.ml', 'wb')
pickle.dump(model,output)
I am getting this error:
TypeError: can't pickle lock objects
I am using Apache Spark 1.2.0.
If you look at the source code, you'll see that the
RandomForestModel
inherits from theTreeEnsembleModel
which in turn inherits fromJavaSaveable
class that implements thesave()
method, so you can save your model like in the example below:So it will save the
model
into thefile_path
using thespark_context
. You cannot use (at least until now) the Python nativle pickle to do that. If you really want to do that, you'll need to implement the methods__getstate__
or__setstate__
manually. See this pickle documentation for more information.