I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it.
Currently my spark session reads local parquet mocks and is defined as such:
val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
I'm going to have to correct the post by himanshuIIITian slightly, (sorry).
Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works with the newer S3 clusters (Seoul, Frankfurt, London, ...), scales better. S3N has fundamental performance issues which have only been fixed in the latest version of Hadoop by deleting that connector entirely. Move on.
You cannot safely use s3 as a direct destination of a Spark query., not with the classic "FileSystem" committers available today. Write to your local file:// and then copy up the data afterwards, using the AWS CLI interface. You'll get better performance as well as the guarantees of reliable writing which you would normally expect from IO
To read and write parquet files from S3 with local Spark, you need to add following 2 dependencies in your
sbt
project-I am assuming its an
sbt
project. If itsmvn
then add following dependencies-Then you need to set S3 credentials in
sparkSession
, like this-And its done. Now, you can Read/Write a Parquet file to S3. For example:
I hope it helps!