问题:

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format('json').save('myfile.json')

df1.write.json('myfile.json')

it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

回答1:

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')

P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

回答2:

This was a better solution for me.

rdd.map(json.dumps) .saveAsTextFile(json_lines_file_name)

回答3:

df1.rdd.repartition(1).write.json('myfile.json')

Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520

PySpark: spit out single file when writing instead

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮