I'm having difficulty on splitting a text data file with delimiter '|' into data frame columns. My loaded data file looks like this:
results1.show()
+--------------------+
| all|
+--------------------+
|DEPT_NO|ART_GRP_N...|
|29|102|354814|SKO...|
|29|102|342677|SKO...|
|29|102|334634|DUR...|
|29|102|319337|SKO...|
|29|102|316731|DUR...|
|29|102|316728|DUR...|
|29|102|316702|DUR...|
|29|102|316702|DUR...|
|29|102|276728|I-P...|
I have tried the following 2 approaches found on previous posts:
results1.select(expr("(split(all, '|'))[1]").cast("integer").as("DEPT_NO"),expr("(split(all, '|'))[4]").cast("integer").as("ART_GRP_NO"), expr("(split(all, '|'))[8]").cast("string").as("ART_NO")).show
+-------+----------+------+
|DEPT_NO|ART_GRP_NO|ART_NO|
+-------+----------+------+
| null| null| ||
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 2|
and
val dataframe10= sc.textFile(("D:/data/dnr10.txt")
.toString())
.map(_.split("|"))
.map(c => {(c(1), c(2),c(3),c(4))})
.toDF()
.show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| D| E| P| T|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 7| || 4|
It looks like the delimiter is not recognized because the splitting is done after each character and not after every '|'. Can somebody give me a hint please on how to get a correct split process in this case?