Spark task duration difference

2019-07-09 04:53发布

I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage task duration distribution Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ? Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.

标签： apache-spark scheduled-tasks apache-spark-sql

1条回答

我命由我不由天

2楼-- · 2019-07-09 05:35

There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..

Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
- extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
- no apparent correlation of data size to the stragglers
- different nodes/workers may experience the delays on subsequent runs of the same job

One thing to check: do hdfs dfsadmin -report and hdfs fsck to see if hdfs were healthy.

0人赞添加讨论(0) 举报

Spark task duration difference

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间