I am having a problem saving text files to S3 using pyspark. I am able to save to S3, but it first uploads to a _temporary on S3 and then proceeds to copy to the intended location. This increases the jobs run time significantly. I have attempted to compile a DirectFileOutputComitter which should write directly to the intended S3 url, but I cannot get Spark to utilize this class.
Example:
someRDD.saveAsTextFile("s3a://somebucket/savefolder")
this creates a
s3a://somebucket/savefolder/_temporary/
directory which is then written to after which a S3 copy operation moves the files back to
s3a://somebucket/savefolder
My question is does anyone have a working jar of the DirectFileOutputCommiter, or if anyone has experience working around this issue.
Relevant Links:
- https://issues.apache.org/jira/browse/HADOOP-10400
- https://gist.github.com/aarondav/c513916e72101bbe14ec
- https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C029201d06334$a0871180$e1953480$@gmail.com%3E
- http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html