Spark foreachPartition run only on master

2019-07-26 07:04发布

I have a DataProc cluster with one master and 4 workers. I have this spark job:

JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(my_data, 8);

rdd_data.foreachPartition(partitionOfRecords -> {
    println("Items in partition-" + partitionOfRecords.count(y=>true));
    })

Where my_data is an array with about 1000 elements. The job on cluster starts in the right way and return correct data, but it runs only on master and not on workers. I use dataproc image 1.4 for every machine in the cluster

Anybody can help me to understand why this job runs only on master?

标签： java apache-spark google-cloud-dataproc data-processing

2条回答

在下西门庆

2楼-- · 2019-07-26 07:35

I found master local[1] in the extra Spark config! Now It works correctly!

0人赞添加讨论(0) 举报

一纸荒年 Trace。

3楼-- · 2019-07-26 07:37

There two points of interest here:

The line println("Items in partition-" + partitionOfRecords.count(y=>true)); will print the expected results only in the case that the executor is the same node as the client running the Spark program. This happens since the println command uses stdout stream under the hood which is accessible only on the same machine hence messages from different nodes can not be propagated to the client program.
When you set master to local[1] you force Spark to run locally using one thread therefore Spark and the client program are using the same stdout stream and you are able to see the program output. That also means that driver and executor is the same node.

0人赞添加讨论(0) 举报

Spark foreachPartition run only on master

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间