After some processing I have a DStream[String , ArrayList[String]] , so when I am writing it to hdfs using saveAsTextFile and after every batch it overwrites the data , so how to write new result by appending to previous results
output.foreachRDD(r => {
r.saveAsTextFile(path)
})
Edit :: If anyone could help me in converting the output to avro format and then writing to HDFS with appending
saveAsTextFile
does not support append. If called with a fixed filename, it will overwrite it every time. We could dosaveAsTextFile(path+timestamp)
to save to a new file every time. That's the basic functionality ofDStream.saveAsTextFiles(path)
An easily accessible format that supports
append
is Parquet. We first transform our data RDD to aDataFrame
orDataset
and then we can benefit from the write support offered on top of that abstraction.Note that appending to Parquet files gets more expensive over time, so rotating the target file from time to time is still a requirement.
If you want to append the same file and store on file system, store it as a parquet file. You can do it by