How to change hdfs block size in pyspark?

2019-04-16 06:33发布

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job? If so, how to do it.

2条回答
看我几分像从前
2楼-- · 2019-04-16 07:00

I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)
查看更多
Anthone
3楼-- · 2019-04-16 07:03

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")
查看更多
登录 后发表回答