Spark persist method never finishes [closed]

2019-03-03 13:02发布

站内文章 / Spark

36 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I call persist method for dataset and usually it works fine but sometimes it never finishes. I use cloudera with Spark 2.1. Does anyone experience same?

originalDataSet = originalDataSet.persist(StorageLevel.MEMORY_AND_DISK());

回答1:

Just like everything else in Spark, the persist operation is lazily evaluated. The first time an action is taken on the underlying data, and the relevant part of the DAG is executed, then they persist will come into play.

Therefore you shouldn't and wouldn't see any problem from the persist command itself.

Data is persisted, in this case to memory and disk, but only to the point of capacity. If that data needs to be removed due to insufficient cache capacity then Spark will do so.

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

See https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence