I know that some of Spark Actions like collect()
cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:
rdd.collect().foreach(println)
. This can cause the driver to run out of memory, though,
because collect()
fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take()
: rdd.take(100).foreach(println)
.
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey()
may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist()
/cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
mapPartitions(WithIndex)
likefilter
,map
,flatMap
etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.combineByKey
(groupByKey
,reduceByKey
,aggregateByKey
) orjoin
and less obvious likesortBy
,distinct
orrepartition
. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:operations which require passing data to and from the driver. Typically it covers actions like
collect
ortake
and creating distributed data structure from a local one (parallelize
).Other members of this category are
broadcasts
(including automatic broadcast joins) andaccumulators
. Total cost depends of course on a particular operation and the amount of data.While some of these operations can be expensive none is particularly bad (including demonized
groupByKey
) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.