How can I change HDFS replication factor for my Sp

2020-07-23 06:32发布

问题:

I need to change the HDFS replication factor from 3 to 1 for my Spark program. While searching, I came up with the "spark.hadoop.dfs.replication" property, but by looking at https://spark.apache.org/docs/latest/configuration.html, it doesn't seem to exist anymore. So, how can I change the hdfs replication factor from my Spark program or using spark-submit?

回答1:

HDFDS configuration is not specific in any way to Spark. You should be able to modify it, using standard Hadoop configuration files. In particular hdfs-site.xml:

<property> 
  <name>dfs.replication<name> 
  <value>3<value> 
<property>

It is also possible to access Hadoop configuration using SparkContext instance:

val hconf: org.apache.hadoop.conf.Configuration = spark.sparkContext.hadoopConfiguration
hconf.setInt("dfs.replication", 3)


回答2:

You should use spark.hadoop.dfs.replication to set the replication factor in HDFS in your spark application. But why you cannot find it in the https://spark.apache.org/docs/latest/configuration.html? That's because that link ONLY contains spark specific configuration. As a matter of fact, any property you set started with spark.hadoop.* will be automatically translated to a Hadoop property, stripping the beginning "spark.haddoop.". You can find how it is implemented at https://github.com/apache/spark/blob/d7b1fcf8f0a267322af0592b2cb31f1c8970fb16/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

The method you should look for is appendSparkHadoopConfigs