Spark 2.0 deprecates 'DirectParquetOutputCommi

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.

Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive

Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.

Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.

One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)

Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?

Thanks.

[1] - https://github.com/apache/spark/blob/21d5ca128bf3afd5c2d4c7fcc56240e28443474f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala

[2] - https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

[3] - https://www.youtube.com/watch?v=85sew9OFaYc&feature=youtu.be&t=8m39s http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-platform

标签： hadoop apache-spark amazon-s3 amazon-emr parquet

2条回答

SAY GOODBYE

2楼-- · 2020-05-16 00:34

You can use: sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

since you are on EMR just use s3 (no need for s3a)

We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)

if you want to read more check out this jira ticket SPARK-10063

0人赞添加讨论(0) 举报

Rolldiameter

3楼-- · 2020-05-16 00:41

I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.

0人赞添加讨论(0) 举报

Spark 2.0 deprecates 'DirectParquetOutputCommi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间