Spark list all cached RDD names

2019-02-23 07:42发布

问题:

I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below

rddName.unpersist()

but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something?

回答1:

@Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs :

scala> val rdd1 = sc.makeRDD(1 to 100)
# rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> val rdd2 = sc.makeRDD(10 to 1000)
# rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> rdd2.cache.setName("rdd_2")
# res0: rdd2.type = rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27)

scala> rdd1.cache.setName("foo")
# res2: rdd1.type = foo ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)

Now let's add another RDD and name it as well :

scala> rdd3.setName("bar")
# res4: rdd3.type = bar ParallelCollectionRDD[2] at makeRDD at <console>:27

scala> sc.getPersistentRDDs
# res5: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)

We noticed that actually it isn't persisted.



回答2:

PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java:

for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()


回答3:

There's no special meaning to the rrdName variable. It is just a reference to an RDD. For example, in the following code

val rrdName: RDD[Something]
val name2 = rrdName

name2 and rrdName are two references that point to the same RDD. Calling name2.unpersist is the same as calling rrdName.unpersist.

If you want to unpersist an RDD, you have to manually keep a reference to it.