spark 1.4 mllib memory pile up with gradient boost

2019-07-07 00:17发布

问题:

Problem with Gradient Boosted Trees (GBT): I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6

What happens if I run GBT for 40 iterations, the input as seen in spark UI becomes larger and larger for certain stages (and the runtime increases correspondingly)

  • MapPartition in DecisionTree.scala L613
  • Collect in DecisionTree.scala L977
  • count DecistionTreeMetadata.scala L 111.

I start with 4GB input and eventually this goes up to over 100GB input increasing by a constant amount. The completion of the related tasks becomes slower and slower. The question is whether this is a correct procedure or whether this is a bug in the MLLib.

My feeling is that somehow more and more data is bound to the relevant data rdd.

Does anyone know how to fix it?

I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.

I am referring to https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree