How to parse a file with newline character, escape

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:

Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;

I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")

However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:

[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)

The expected result is:

[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)

(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)

Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)

My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying java.io.IOException: EOF whilst processing escape sequence].

Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?

Thanks!

EDIT: Forgot to mention, i'm using spark 1.6

标签： scala csv parsing apache-spark databricks

1条回答

【Aperson】

2楼-- · 2019-07-27 12:27

wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though

val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
                .flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
                .map(x => x.split(";"))

the rdd output is

[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]

0人赞添加讨论(0) 举报

How to parse a file with newline character, escape

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间