Trying to read and write parquet files from s3 wit

2020-06-05 01:32发布

问题:

I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it.

Currently my spark session reads local parquet mocks and is defined as such:

val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()

回答1:

I'm going to have to correct the post by himanshuIIITian slightly, (sorry).

  1. Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works with the newer S3 clusters (Seoul, Frankfurt, London, ...), scales better. S3N has fundamental performance issues which have only been fixed in the latest version of Hadoop by deleting that connector entirely. Move on.

  2. You cannot safely use s3 as a direct destination of a Spark query., not with the classic "FileSystem" committers available today. Write to your local file:// and then copy up the data afterwards, using the AWS CLI interface. You'll get better performance as well as the guarantees of reliable writing which you would normally expect from IO



回答2:

To read and write parquet files from S3 with local Spark, you need to add following 2 dependencies in your sbt project-

"com.amazonaws" % "aws-java-sdk" % "1.7.4"
"org.apache.hadoop" % "hadoop-aws" % "2.7.3"

I am assuming its an sbt project. If its mvn then add following dependencies-

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk</artifactId>
    <version>1.7.4</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.7.3</version>
</dependency>

Then you need to set S3 credentials in sparkSession, like this-

val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "s3AccessKey")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "s3SecretKey")

And its done. Now, you can Read/Write a Parquet file to S3. For example:

sparkSession.read.parquet("s3n://bucket/abc.parquet")    //Read
df.write.parquet("s3n://bucket/xyz.parquet")    //Write

I hope it helps!