What does the meaning of the number in the bracket after rdd?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
The number after RDD is its identifier:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val rdd = sc.range(0, 42)
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at <console>:24
scala> rdd.id
res0: Int = 1
It is used to track RDD across the session, for example for purposes like caching
:
scala> rdd.cache
res1: rdd.type = MapPartitionsRDD[1] at range at <console>:24
scala> rdd.count
res2: Long = 42
scala> sc.getPersistentRDDs
res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> MapPartitionsRDD[1] at range at <console>:24)
This number is simple an incremental integer (nextRddId
is just an AtomicInteger
):
private[spark] def newRddId(): Int = nextRddId.getAndIncrement()
generated when RDD is constructed:
/** A unique ID for this RDD (within its SparkContext). */
val id: Int = sc.newRddId()
so if we followed:
scala> val pairs1 = sc.parallelize(Seq((1, "foo")))
pairs1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val pairs2 = sc.parallelize(Seq((1, "bar")))
pairs2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> pairs1.id
res5: Int = 2
scala> pairs2.id
res6: Int = 3
you'll see 2 and 3, and if you execute
scala> pairs1.join(pairs2).foreach(_ => ())
you'd expect 4, which can confirmed by checking the UI:
We can also see that join
creates a few new RDDs
under the covers (5
and 6
).