As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?
问题:
回答1:
Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.
Consider org/apache/spark/rdd/MappedRDD.scala
for example:
private[spark]
class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)
extends RDD[U](prev) {
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def compute(split: Partition, context: TaskContext) =
firstParent[T].iterator(split, context).map(f)
}
When you say rdd2 = rdd1.map(...)
, rdd2
will be such a MappedRDD
. compute
is only executed later, for example when you call rdd2.collect
.
An RDD is always such a metadata, even if it has no parents (for example sc.textFile(...)
). The only case an RDD is stored on the nodes, is if you mark it for caching with rdd.cache
, and then cause it to be computed.
Another similar situation is calling rdd.checkpoint
. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.
The difference between cache
and checkpoint
is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.