I train a Random Forest with pySpark. I want to have a csv with the results, per dot in the grid. My code is:
estimator = RandomForestRegressor()
evaluator = RegressionEvaluator()
paramGrid = ParamGridBuilder().addGrid(estimator.numTrees, [2,3])\
.addGrid(estimator.maxDepth, [2,3])\
.addGrid(estimator.impurity, ['variance'])\
.addGrid(estimator.featureSubsetStrategy, ['sqrt'])\
.build()
pipeline = Pipeline(stages=[estimator])
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
cvModel = crossval.fit(result)
So I want a csv:
numTrees | maxDepth | impurityMeasure
2 2 0.001
2 3 0.00023
Etc
What is the best way to do this?
You'll have to combine different bits of data:
Estimator
ParamMaps
extracted usinggetEstimatorParamMaps
method.avgMetrics
parameter.First get names and values of all parameters declared in the map:
Thane
zip
with metrics and convert to a data frame