Saving DataFrame to Parquet takes lot of time

2019-09-16 11:18发布

问题:

I have a spark data frame which has around 458MM rows. It was initially an RDD so then I converted to spark data frame using sqlcontext.createDataFrame

First few rows of RDD are as follows:

sorted_rdd.take(5)
Out[25]:
[(353, 21, u'DLR_Where Dreams Come True Town Hall', 0, 0.896152913570404),
 (353, 2, u'DLR_Leading at a Higher Level', 1, 0.7186800241470337),
 (353,
  220,
  u'DLR_The Year of a Million Dreams Leadership Update',
  0,
  0.687175452709198),
 (353, 1, u'DLR_Challenging Conversations', 1, 0.6632049083709717),
 (353,
  0,
  u'DLR_10 Keys to Inspiring, Engaging, and Energizing Your People',
  1,
  0.647541344165802)]

I save it into data frame as below

sorted_df=sqlContext.createDataFrame(sorted_rdd,['user','itemId','itemName','Original','prediction'])

And finally saving it as below:

sorted_df.write.parquet("predictions_df.parquet") 

I am using Spark with Yarn having 50 executors of 10g each and 5 cores. The write command keeps running for an hour and still the file is not saved yet.

What keeps it this slow?

回答1:

Two things I can think of to try:

  1. You might want to check the number of partitions you have. If you have too few partitions then you don't get the required parallelism.

  2. Spark does its stuff lazily. This means that it could be that the writing is very fast but the calculation in order to get to it is slow. What you can try to do is cache the dataframe (and perform some action such as count on it to make sure it materializes) and then try to write again. If the saving part is fast now then the problem is with the calculation and not the parquet writing.



回答2:

also try to increase cores if you have enough, this is one of the main thing because number cores is proportional to number of executors. So, that the parallel processing possible.