Converting pipe-delimited file to spark dataframe

2019-07-24 00:00发布

I have a CSV file with one single column and the rows are defined as follows :

123 || food || fruit
123 || food || fruit || orange 
123 || food || fruit || apple

I want to create a csv file with a single column and distinct row values as :

orange
apple

I tried using the following code :

 val data = sc.textFile("fruits.csv")
 val rows = data.map(_.split("||"))
 val rddnew = rows.flatMap( arr => {
 val text = arr(0) 
 val words = text.split("||")
 words.map( word => ( word, text ) )
 } )

But this code is not giving me the correct result as wanted.
Can anyone please help me with this ?

标签： scala csv apache-spark spark-streaming rdd

2条回答

地球回转人心会变

2楼-- · 2019-07-24 00:53

you can solve this problem similar to this code

val text = sc.textFile("fruit.csv")
val word = text.map( l => l.split("\\|\\|")
val last = word.map( w => w(w.size - 1))
last.distinct.collect

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-07-24 00:55

you need to split with escape for special characters, since split takes regex

.split("\\|\\|")

converting to CSV is tricky because data strings may potentially contain your delimiter (in quotes), new-line or other parse-sensitive characters, so I'd recommend using spark-csv

 val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("delimiter", "||")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("words.csv")

and

 words.write
  .format("com.databricks.spark.csv")
  .option("delimiter", "||")
  .option("header", "true")
  .save("words.csv")

0人赞添加讨论(0) 举报

Converting pipe-delimited file to spark dataframe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间