Is there an “Explain RDD” in spark

In particular, if I say

rdd3 = rdd1.join(rdd2)

then when I call rdd3.collect, depending on the Partitioner used, either data is moved between nodes partitions, or the join is done locally on each partition (or, for all I know, something else entirely). This depends on what the RDD paper calls "narrow" and "wide" dependencies, but who knows how good the optimizer is in practice.

Anyways, I can kind of glean from the trace output which thing actually happened, but it would be nice to call rdd3.explain.

Does such a thing exist?

标签： apache-spark rdd

2条回答

冷血范

2楼-- · 2019-04-04 09:25

I think toDebugString will appease your curiosity.

scala> val data = sc.parallelize(List((1,2)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:21

scala> val joinedData = data join data
joinedData: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[11] at join at <console>:23

scala> joinedData.toDebugString
res4: String =
(8) MapPartitionsRDD[11] at join at <console>:23 []
 |  MapPartitionsRDD[10] at join at <console>:23 []
 |  CoGroupedRDD[9] at join at <console>:23 []
 +-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []
 +-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []

Each indentation is a stage, so this should run as two stages.

Also, the optimizer is fairly decent, however I would suggest using DataFrames if you are using 1.3+ as the optimizer there is EVEN better in many cases:)

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2019-04-04 09:36

I would use Spark UI (the web page the spark context used to serve) instead of toDebugString whenever I can. Much easier to comprehend, and a bit more information (and less glitches according my very limited experience). Also, Spark UI shows the number of Tasks and their input and output sizes for each Stage, which helps figuring out what it does.

Besides, there's very little information shown in both of them. Mostly just a graph of boxes saying MapPartitionsRDD [12] and such, which doesn't tell much about what that step actually does. (For WholeStageCodegen boxes the DEBUG log under org.apache.spark.sql.execution contains the generated code at least. But there's no any kind of ID logged to pair them with what you see on Spark UI.)

0人赞添加讨论(0) 举报

Is there an “Explain RDD” in spark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间