spark 1.4 mllib memory pile up with gradient boost

2019-07-07 00:27发布

Problem with Gradient Boosted Trees (GBT): I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6

What happens if I run GBT for 40 iterations, the input as seen in spark UI becomes larger and larger for certain stages (and the runtime increases correspondingly)

MapPartition in DecisionTree.scala L613
Collect in DecisionTree.scala L977
count DecistionTreeMetadata.scala L 111.

I start with 4GB input and eventually this goes up to over 100GB input increasing by a constant amount. The completion of the related tasks becomes slower and slower. The question is whether this is a correct procedure or whether this is a bug in the MLLib.

My feeling is that somehow more and more data is bound to the relevant data rdd.

Does anyone know how to fix it?

I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.

I am referring to https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree

标签： apache-spark tree gradient apache-spark-mllib

0条回答

spark 1.4 mllib memory pile up with gradient boost

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间