I have a spark data frame which has around 458MM rows. It was initially an RDD so then I converted to spark data frame using sqlcontext.createDataFrame
First few rows of RDD are as follows:
sorted_rdd.take(5)
Out[25]:
[(353, 21, u'DLR_Where Dreams Come True Town Hall', 0, 0.896152913570404),
(353, 2, u'DLR_Leading at a Higher Level', 1, 0.7186800241470337),
(353,
220,
u'DLR_The Year of a Million Dreams Leadership Update',
0,
0.687175452709198),
(353, 1, u'DLR_Challenging Conversations', 1, 0.6632049083709717),
(353,
0,
u'DLR_10 Keys to Inspiring, Engaging, and Energizing Your People',
1,
0.647541344165802)]
I save it into data frame as below
sorted_df=sqlContext.createDataFrame(sorted_rdd,['user','itemId','itemName','Original','prediction'])
And finally saving it as below:
sorted_df.write.parquet("predictions_df.parquet")
I am using Spark with Yarn having 50 executors of 10g each and 5 cores. The write command keeps running for an hour and still the file is not saved yet.
What keeps it this slow?