How to overwrite RDD output objects any existing path when we are saving time.
test1:
975078|56691|2.000|20171001_926_570_1322
975078|42993|1.690|20171001_926_570_1322
975078|46462|2.000|20171001_926_570_1322
975078|87815|1.000|20171001_926_570_1322
rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2])))
rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")
The first time it is saving properly. now again I removed one line from the input file and saving RDD same location, it show file has existed.
rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1")
For example, in dataframe we can overwrite existing path.
df.coalesce(1).write().overwrite().save(path)
If I am doing same on RDD object getting an error.
rdd.coalesce(1).write().overwrite().saveAsPickleFile(path)
please help me on this
Hi you can save RDD files like below Note (code is in scala but logic should be same for python as well) i am using 2.3.0 spark version.
or if ur working with DataFrame then use
or for more info please look at this
While, the rdd without write mode, and you can convert rdd to df , using df overwrite mode. As follows: