Suppose we have DataFrame df
consisting of the following columns:
Name, Surname, Size, Width, Length, Weigh
Now we want to perform a couple of operations, for example we want to create a couple of DataFrames containing data about Size and Width.
val df1 = df.groupBy("surname").agg( sum("size") )
val df2 = df.groupBy("surname").agg( sum("width") )
as you can notice, other columns, like Length are not used anywhere. Is Spark smart enough to drop the redundant columns before the shuffling phase or are they carried around? Wil running:
val dfBasic = df.select("surname", "size", "width")
before grouping somehow affect the performance?
Yes, it is "smart enough".
groupBy
performed on aDataFrame
is not the same operation asgroupBy
performed on a plain RDD. In a scenario you've described there is no need to move raw data at all. Let's create a small example to illustrate that:As you can the first phase is a projection where only required columns are preserved. Next data is aggregated locally and finally transferred and aggregated globally. You'll get a little bit different answer output if you use Spark <= 1.4 but general structure should be exactly the same.
Finally a DAG visualization showing that above description describes actual job:
Similarly,
Dataset.groupByKey
followed byreduceGroups
, contains both map-side (ObjectHashAggregate
withpartial_reduceaggregator
) and reduce-side (ObjectHashAggregate
withreduceaggregator
reduction):However other methods of
KeyValueGroupedDataset
might work similarly toRDD.groupByKey
. For examplemapGroups
(orflatMapGroups
) doesn't use partial aggregation.