I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:
public JavaRDD<String> foo(JavaRDD<String> r) {
r.cache();
JavaRDD t1 = r... //Some calculations
JavaRDD t2 = r... //Other calculations
return t1.union(t2);
}
My question is, since r
is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1
and t2
are calculated I will have two instances of r
in the cache? or will spark is aware of the fact that r
is cached and will ignore it?
just test on my cluster, Zohar is right, nothing happens, it will just cache the RDD for once. The reason, I think, is that every RDD has an
id
internally, spark will use theid
to mark whether a RDD have been cached or not. so cache one RDD for multiple times will do nothing.bellow is my code and screenshot:
updated [ add code as required ]
Nothing. If you call
cache
on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:cache
, the RDD'sstorageLevel
is set toMEMORY_ONLY
cache
again, it's set to the same value (no change)storageLevel
and if it requires caching, it will cache it.So you're safe.