In terms of RDD
persistence, what are the differences between cache()
and persist()
in spark ?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
With
cache()
, you use only the default storage levelMEMORY_ONLY
. Withpersist()
, you can specify which storage level you want,(rdd-persistence).From the official docs:
Use
persist()
if you want to assign a storage level other thanMEMORY_ONLY
to theRDD
(which storage level to choose)Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as
RDD
s are thus kept in memory (default) or more solid storage like disk and/or replicated.RDD
s can be cached usingcache
operation. They can also be persisted usingpersist
operation.Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)
Just because you can cache a
RDD
in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..
Listing Variants...
*See below example : *
Note : Due to the very small and purely syntactic difference between caching and persistence of
RDD
s the two terms are often used interchangeably.See more visually here....
Persist in memory and disk:
Cache
Caching can improve the performance of your application to a great extent.
There is no difference. From
RDD.scala
.Spark gives 5 types of Storage level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
cache()
will useMEMORY_ONLY
. If you want to use something else, usepersist(StorageLevel.<*type*>)
.By default
persist()
will store the data in the JVM heap as unserialized objects.