Spark RDD - is partition(s) always in RAM?

2019-01-16 21:15发布

问题:

We all know Spark does the computation in memory. I am just curious on followings.

  1. If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?

  2. If I do not delete RDD, will it be in memory forever?

  3. If my dataset(file) size exceeds available RAM size, where will data to stored?

回答1:

If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?

Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.

If I do not delete RDD, will it be in memory forever?

Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory. link to read more

If my dataset size exceeds available RAM size, where will data to stored?

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more

If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment

To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.

In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.



回答2:

If I create 10 RDD in my Pyspark shell, does it mean all these 10 RDD data will reside on Spark Memory?

Answer: RDD only contains the "lineage graph" (the applied transformations). So, RDD is not data!!! When ever we perform any action on an RDD, all the transformations are applied before the action. So if not explicitly (of course there are some optimisations which cache implicitly) cached, each time an action is performed the whole transformation and action are performed again!!!

E.g - If you create an RDD from HDFS, apply some transformations and perform 2 actions on the transformed RDD, HDFS read and transformations will be executed twice!!!

So, if you want to avoid the re-computation, you have to persist the RDD. For persisting you have the choice of a combination of one or more on HEAP, Off-Heap, Disk.

If I do not delete RDD, will it be in memory for ever?

Answer: Considering RDD is just "lineage graph", it will follow the same scope and lifetime rule of the hosting language. But if you have already persisted the computed result, you could unpersist!!!

If my dataset size exceed available RAM size, where will data to stored?

Answer: Assuming you have actually persisted/cached the RDD in memory, it will be stored in memory. And LRU is used to evict data. Refer for more information on how memory management is done in spark.