I know that when we want to initialize some resource for a group of RDDs instead of individual RDD elements we should ideally use the mapPartition and foreachPartition. For example in case of initializing a JDBC connection for each partition of data. But are there scenarios where we should not use either of them and instead use plain vanilla map() and foreach() transformation and action.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
When you write Spark jobs that uses either mapPartition or foreachPartition you can just modify the partition data itself or just iterate through partition data respectively. The anonymous function passed as parameter will be executed on the executors thus there is not a viable way to execute a code which invokes all the nodes e.g: df.reduceByKey from one particular executor. This code should be executed only from the driver node. Thus only from the driver code you can access dataframes, datasets and spark session.
Please find here a detailed discussion over this issue and possible solutions