I have a CSV file with one single column and the rows are defined as follows :
123 || food || fruit
123 || food || fruit || orange
123 || food || fruit || apple
I want to create a csv file with a single column and distinct row values as :
orange
apple
I tried using the following code :
val data = sc.textFile("fruits.csv")
val rows = data.map(_.split("||"))
val rddnew = rows.flatMap( arr => {
val text = arr(0)
val words = text.split("||")
words.map( word => ( word, text ) )
} )
But this code is not giving me the correct result as wanted.
Can anyone please help me with this ?
you need to split with escape for special characters, since split takes regex
.split("\\|\\|")
converting to CSV is tricky because data strings may potentially contain your delimiter (in quotes), new-line or other parse-sensitive characters, so I'd recommend using spark-csv
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "||")
.option("header", "true")
.option("inferSchema", "true")
.load("words.csv")
and
words.write
.format("com.databricks.spark.csv")
.option("delimiter", "||")
.option("header", "true")
.save("words.csv")
you can solve this problem similar to this code
val text = sc.textFile("fruit.csv")
val word = text.map( l => l.split("\\|\\|")
val last = word.map( w => w(w.size - 1))
last.distinct.collect