What is the result of RDD transformation in Spark?

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?

标签： apache-spark rdd

5条回答

我想做一个坏孩纸

2楼-- · 2019-04-09 17:05

RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.

So if you do this:

val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()

words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).

So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2019-04-09 17:10

Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable. All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.

Example of RDD transformations: map,filter,flatMap,groupByKey,reduceByKey

0人赞添加讨论(0) 举报

走好不送

4楼-- · 2019-04-09 17:22

As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).

0人赞添加讨论(0) 举报

ゆ、 Hurt°

5楼-- · 2019-04-09 17:25

All answers are perfectly valid. I just want to add a quick picture :-) enter image description here

0人赞添加讨论(0) 举报

放荡不羁爱自由

6楼-- · 2019-04-09 17:29

Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

0人赞添加讨论(0) 举报

What is the result of RDD transformation in Spark?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间