Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.
Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive
Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.
Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.
One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)
Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?
Thanks.
[2] - https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
[3] - https://www.youtube.com/watch?v=85sew9OFaYc&feature=youtu.be&t=8m39s http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-platform
You can use:
sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
since you are on EMR just use s3 (no need for s3a)
We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)
if you want to read more check out this jira ticket SPARK-10063
I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.