可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to make my Spark Streaming application reading his input from a S3 directory but I keep getting this exception after launching it with spark-submit script:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)
at MainClass$.main(MainClass.scala:1190)
at MainClass.main(MainClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm setting those variables through this block of code as suggested here http://spark.apache.org/docs/latest/ec2-scripts.html (bottom of the page):
val ssc = new org.apache.spark.streaming.StreamingContext(
conf,
Seconds(60))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",args(2))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",args(3))
args(2) and args(3) are my AWS Access Key ID and Secrete Access Key of course.
Why it keeps saying they are not set?
EDIT: I tried also this way but I get the same exception:
val lines = ssc.textFileStream("s3n://"+ args(2) +":"+ args(3) + "@<mybucket>/path/")
回答1:
Odd. Try also doing a .set
on the sparkContext
. Try also exporting env variables before you start the application:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
^^this is how we do it.
UPDATE: According to @tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
回答2:
The following configuration works for me, make sure you also set "fs.s3.impl":
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)
回答3:
On AWS EMR the above suggestions did not work. Instead I updated the following properties in the conf/core-site.xml:
fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey with your S3 credentials.
回答4:
For those using EMR, use the Spark build as described at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and just reference S3 with the s3:// URI. No need to set S3 implementation or additional configuration as credentials are set by IAM or role.
回答5:
I wanted to put the credentials more securely in a config file on one of my encrypted partitions. So I did export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf
before running my spark application, and put a file in that directory (encrypted via ecryptfs
) called core-site.xml
containing the credentials like this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
HADOOP_CONF_DIR
can also be set in conf/spark-env.sh
.
回答6:
Latest EMR releases (tested on 4.6.0) require the following configuration:
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
Although in most cases out of the box config should work - this is if you have different S3 credentials from the ones you launched the cluster with.
回答7:
this works for me in 1.4.1 shell:
val conf = sc.getConf
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
conf.set("spark.hadoop.fs.s3.awsAccessKeyId", <your access key>)
conf.set("spark.hadoop.fs.s3.awsSecretAccessKey", <your secret key>)
SparkHadoopUtil.get.conf.addResource(SparkHadoopUtil.get.newConfiguration(conf))
...
sqlContext.read.parquet("s3://...")
回答8:
Augmenting @nealmcb's answer, the most straightforward way to do this is to define
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
in conf/spark-env.sh
or export that env variable in ~/.bashrc
or ~/.bash_profile
.
That will work as long as you can access s3 through hadoop. For instance, if you can run
hadoop fs -ls s3n://path/
then hadoop can see the s3 path.
If hadoop can't see the path, follow the advice of contained in How can I access S3/S3n from a local Hadoop 2.6 installation?
回答9:
In java, the following are the lines of code. You have to add AWS creds in SparkContext only and not SparkSession.
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);