可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files

If it keep creating n numbers of files,i feel it won't be much efficient

HDFS FILE SYSYTEM

Code

lines.foreachRDD(f => {
  if (!f.isEmpty()) {
    val df = f.toDF().coalesce(1)
    df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
  }
 })

In my pom I am using respective dependencies:

spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11

回答1:

As you already realized Append in Spark means write-to-existing-directory not append-to-file.

This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).

Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.

回答2:

It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.