Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.
However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF
, that every user defined function does 3 things:
- convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
- apply the function
- convert the result back from scala type to catalyst type
Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?
Thank you so much for any suggestion or insight.