As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.
Consider
org/apache/spark/rdd/MappedRDD.scala
for example:When you say
rdd2 = rdd1.map(...)
,rdd2
will be such aMappedRDD
.compute
is only executed later, for example when you callrdd2.collect
.An RDD is always such a metadata, even if it has no parents (for example
sc.textFile(...)
). The only case an RDD is stored on the nodes, is if you mark it for caching withrdd.cache
, and then cause it to be computed.Another similar situation is calling
rdd.checkpoint
. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.The difference between
cache
andcheckpoint
is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.