I call persist method for dataset and usually it works fine but sometimes it never finishes. I use cloudera with Spark 2.1. Does anyone experience same?
originalDataSet = originalDataSet.persist(StorageLevel.MEMORY_AND_DISK());
I call persist method for dataset and usually it works fine but sometimes it never finishes. I use cloudera with Spark 2.1. Does anyone experience same?
originalDataSet = originalDataSet.persist(StorageLevel.MEMORY_AND_DISK());
Just like everything else in Spark, the persist
operation is lazily evaluated. The first time an action is taken on the underlying data, and the relevant part of the DAG is executed, then they persist will come into play.
Therefore you shouldn't and wouldn't see any problem from the persist
command itself.
Data is persisted, in this case to memory and disk, but only to the point of capacity. If that data needs to be removed due to insufficient cache capacity then Spark will do so.
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
See https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence