What is RDD dependency in Spark?

2019-05-06 12:25发布

As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?

标签： apache-spark rdd

1条回答

Animai°情兽

2楼-- · 2019-05-06 13:00

Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.

Consider org/apache/spark/rdd/MappedRDD.scala for example:

private[spark]
class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)
  extends RDD[U](prev) {

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext) =
    firstParent[T].iterator(split, context).map(f)
}

When you say rdd2 = rdd1.map(...), rdd2 will be such a MappedRDD. compute is only executed later, for example when you call rdd2.collect.

An RDD is always such a metadata, even if it has no parents (for example sc.textFile(...)). The only case an RDD is stored on the nodes, is if you mark it for caching with rdd.cache, and then cause it to be computed.

Another similar situation is calling rdd.checkpoint. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.

The difference between cache and checkpoint is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.

0人赞添加讨论(0) 举报

What is RDD dependency in Spark?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间