I need to change the HDFS replication factor from 3 to 1 for my Spark program. While searching, I came up with the "spark.hadoop.dfs.replication" property, but by looking at https://spark.apache.org/docs/latest/configuration.html, it doesn't seem to exist anymore. So, how can I change the hdfs replication factor from my Spark program or using spark-submit?
问题:
回答1:
HDFDS configuration is not specific in any way to Spark. You should be able to modify it, using standard Hadoop configuration files. In particular hdfs-site.xml
:
<property>
<name>dfs.replication<name>
<value>3<value>
<property>
It is also possible to access Hadoop configuration using SparkContext
instance:
val hconf: org.apache.hadoop.conf.Configuration = spark.sparkContext.hadoopConfiguration
hconf.setInt("dfs.replication", 3)
回答2:
You should use spark.hadoop.dfs.replication
to set the replication factor in HDFS in your spark application. But why you cannot find it in the https://spark.apache.org/docs/latest/configuration.html? That's because that link ONLY contains spark specific configuration. As a matter of fact, any property you set started with spark.hadoop.*
will be automatically translated to a Hadoop property, stripping the beginning "spark.haddoop.
". You can find how it is implemented at https://github.com/apache/spark/blob/d7b1fcf8f0a267322af0592b2cb31f1c8970fb16/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
The method you should look for is appendSparkHadoopConfigs