Spark 1.6 DirectFileOutputCommitter

2019-09-17 14:31发布

问题:

I am having a problem saving text files to S3 using pyspark. I am able to save to S3, but it first uploads to a _temporary on S3 and then proceeds to copy to the intended location. This increases the jobs run time significantly. I have attempted to compile a DirectFileOutputComitter which should write directly to the intended S3 url, but I cannot get Spark to utilize this class.

Example:

someRDD.saveAsTextFile("s3a://somebucket/savefolder")

this creates a

s3a://somebucket/savefolder/_temporary/

directory which is then written to after which a S3 copy operation moves the files back to

s3a://somebucket/savefolder

My question is does anyone have a working jar of the DirectFileOutputCommiter, or if anyone has experience working around this issue.

Relevant Links:

https://issues.apache.org/jira/browse/HADOOP-10400
https://gist.github.com/aarondav/c513916e72101bbe14ec
https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C029201d06334$a0871180$e1953480$@gmail.com%3E
http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html

I was able to fix this issue by patching Hadoop 2.7.2 with a DirectOutputCommitter from Databricks and deploying the patched jar to my spark instances. Linked below is a git repo with the patched jar.

Github Link