What happens if I cache the same RDD twice in Spar

I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?

标签： java caching apache-spark rdd

2条回答

别忘想泡老子

2楼-- · 2020-03-03 06:08

just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an id internally, spark will use the id to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.

bellow is my code and screenshot:

updated [ add code as required ]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2020-03-03 06:19

Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:

When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
When you call cache again, it's set to the same value (no change)
Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

So you're safe.

0人赞添加讨论(0) 举报

What happens if I cache the same RDD twice in Spar

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间