What happens if I cache the same RDD twice in Spar

2020-03-03 05:26发布

I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?

2条回答
别忘想泡老子
2楼-- · 2020-03-03 06:08

just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id internally, spark will use the id to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.

bellow is my code and screenshot:

enter image description here enter image description here

updated [ add code as required ]


### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()
查看更多
何必那么认真
3楼-- · 2020-03-03 06:19

Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

  • When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
  • When you call cache again, it's set to the same value (no change)
  • Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

So you're safe.

查看更多
登录 后发表回答