How to add a column to a DataFrame with the mean o

2019-07-19 06:42发布

问题:

Is there a better way?

val mean = df.select(avg("date")).first().getDouble(0)
df.withColumn("mean", lit(mean))

I assume that it could worth it to avoid calling an action …

回答1:

It is possible to avoid additional action using broadcast with cross product:

import org.apache.spark.sql.functions.broadcast

df.crossJoin(broadcast(df.agg(avg("date"))))

or:

spark.conf.set("spark.sql.crossJoin.enabled", true)

df.join(broadcast(df.agg(avg("date"))))

What you shouldn't do is using window functions:

df.withColumn("avg", avg("date").over())