From the Spark source code:
/**
* Represents the content of the Dataset as an `RDD` of `T`.
*
* @group basic
* @since 1.6.0
*/
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
The mapPartitions
can take as long as the time to compute the RDD
in the first place.. So this makes operations such as
df.rdd.getNumPartitions
very expensive. Given that a DataFrame
is DataSet[Row]
and a DataSet
is composed of RDD
's why is a re-mapping required? Any insights appreciated.
TL;DR That's because the internal
RDD
is notRDD[Row]
.That's a huge oversimplification. First of all
DataSet[T]
doesn't mean that you interact with container ofT
. It means that if you use collection-like API (often referred as strongly typed), internal representation will be decoded intoT
.The internal representation is a binary format used internally by Tungsten.This representation is internal and subject of changes and far too low level to be used in practice.
An intermediate representation, which exposes this data is the
InternalRow
-rddQueryExecution.toRDD
is in factRDD[InternalRow]
. This representation (there are different implementation) still exposes the internal types, is consider "weakly" private, as all objects ino.a.s.sql.catalyst
(the access is not explicitly restricted, but API is not documented), and rather tricky to interact with.This where decoding comes into play and why you need full "re-mapping" - to convert internal, often unsafe, objects into external types intended for public usage.
Finally, to reiterate my previous statement - the code in question won't be executed when
getNumPartitions
is called.