2楼-- · 2019-01-12 16:43

With cache(), you use only the default storage level MEMORY_ONLY. With persist(), you can specify which storage level you want,(rdd-persistence).

From the official docs:

You can mark an RDD to be persisted using the persist() or cache() methods on it.

each persisted RDD can be stored using a different storage level

The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

Use persist() if you want to assign a storage level other than MEMORY_ONLY to the RDD (which storage level to choose)

0人赞添加讨论(0) 举报

冷血范

3楼-- · 2019-01-12 16:45

Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storage like disk and/or replicated. RDDs can be cached using cache operation. They can also be persisted using persist operation.

persist, cache

These functions can be used to adjust the storage level of a RDD. When freeing up memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameter less variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY).

Warning: Once the storage level has been changed, it cannot be changed again!

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.

It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..

Listing Variants...

def cache(): RDD[T]
 def persist(): RDD[T]
 def persist(newLevel: StorageLevel): RDD[T]

*See below example : *

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
     c.getStorageLevel
     res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
     c.cache
     c.getStorageLevel
     res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)

The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY

Note : Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.

See more visually here....

Persist in memory and disk:

Cache

Caching can improve the performance of your application to a great extent.

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-01-12 16:53

There is no difference. From RDD.scala.

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

0人赞添加讨论(0) 举报

虎瘦雄心在

5楼-- · 2019-01-12 16:58

Spark gives 5 types of Storage level

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY

cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).

By default persist() will store the data in the JVM heap as unserialized objects.

0人赞添加讨论(0) 举报

What is the difference between cache and persist?

`persist`, `cache`

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Cache

What is the difference between cache and persist?

persist, cache

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Cache

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

`persist`, `cache`