I have a DataProc cluster with one master and 4 workers. I have this spark job:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(my_data, 8);
rdd_data.foreachPartition(partitionOfRecords -> {
println("Items in partition-" + partitionOfRecords.count(y=>true));
})
Where my_data is an array with about 1000 elements. The job on cluster starts in the right way and return correct data, but it runs only on master and not on workers. I use dataproc image 1.4 for every machine in the cluster
Anybody can help me to understand why this job runs only on master?
I found master local[1] in the extra Spark config! Now It works correctly!
There two points of interest here:
println("Items in partition-" + partitionOfRecords.count(y=>true));
will print the expected results only in the case that the executor is the same node as the client running the Spark program. This happens since theprintln
command uses stdout stream under the hood which is accessible only on the same machine hence messages from different nodes can not be propagated to the client program.