Problem with Gradient Boosted Trees (GBT): I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6
What happens if I run GBT for 40 iterations, the input as seen in spark UI becomes larger and larger for certain stages (and the runtime increases correspondingly)
- MapPartition in DecisionTree.scala L613
- Collect in DecisionTree.scala L977
- count DecistionTreeMetadata.scala L 111.
I start with 4GB input and eventually this goes up to over 100GB input increasing by a constant amount. The completion of the related tasks becomes slower and slower. The question is whether this is a correct procedure or whether this is a bug in the MLLib.
My feeling is that somehow more and more data is bound to the relevant data rdd.
Does anyone know how to fix it?
I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.
I am referring to https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree