I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case.
My intention is not having to save the output as a new dataframe.
My current code is rather simple:
encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
.groupBy('timePeriod')
.agg(
mean('DOWNSTREAM_SIZE').alias("Mean"),
stddev('DOWNSTREAM_SIZE').alias("Stddev")
)
.show(20, False)
And my intention is to add count()
after using groupBy
, to get, well, the count of records matching each value of timePeriod
column, printed\shown as output.
When trying to use groupBy(..).count().agg(..)
I get exceptions.
Is there any way to achieve both count()
and agg()
.show() prints, without splitting code to two lines of commands, e.g. :
new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()
Or better yet, for getting a merged output to agg.show()
output - An extra column which states the counted number of records matching the row's value. e.g.:
timePeriod | Mean | Stddev | Num Of Records
X | 10 | 20 | 315