This question already has an answer here:
-
How to pivot Spark DataFrame?
6 answers
I has a text file. Now, I want expand vertical data to be horizontal as Exp1
& Exp2
. What should I do?
This is my input:
0000000 aa______ 50 F 91
0000000 aa______ 50 F 59
0000000 aa______ 50 F 20
0000000 aa______ 50 F 76
0000001 bb______ 50 F 46
0000001 bb______ 50 F 39
0000001 bb______ 50 F 8
0000001 bb______ 50 F 5
0000003 cc______ 26 F 30
0000003 cc______ 26 F 50
0000003 cc______ 26 F 71
0000003 cc______ 26 F 36
0000004 dd______ 40 M 58
0000004 dd______ 40 M 71
0000004 dd______ 40 M 20
0000004 dd______ 40 M 10
Exp1: I want to have the desired output result as below. Here the key used are the first four words in the data. The rest of the values are put on a single line.
00000000 aa______ 50 F 91 59 20 76
0000001 bb______ 50 F 46 39 8 5
0000003 cc______ 26 F 30 50 71 36
0000004 dd______ 40 M 58 71 20 10
Exp2: Same as above, however, here the values are divided into two separate lines (with the same key words):
0000000 aa______ 50 F 91 59
0000000 aa______ 50 F 20 76
0000001 bb______ 50 F 46 39
0000001 bb______ 50 F 8 5
0000003 cc______ 26 F 30 50
0000003 cc______ 26 F 71 36
0000004 dd______ 40 M 58 71
0000004 dd______ 40 M 20 10
This would probably be easier to solve if the dataframe had the data separated into columns. However, what you can do in this case is to split the string into two parts; one key and one with the other values. This can easily be done in an UDF
(it's possible with Spark functions as well but it won't be as clear):
// put the case class outside main method
case class SplitReturn(key: String, vals: String)
val splitKeyVal = udf((str: String) => {
val key = str.split(" ").init.mkString(" ")
val vals = str.split(" ").last
SplitReturn(key, vals)
})
def groupVals(n: Int) = udf((vals: Seq[String]) => {
vals.grouped(n).map(_.mkString(" ")).toSeq
})
Here the parameter nGroup
determines how many words to use as the key. Example of usage with nGroup=4
:
val nGroup = 4
val df3 = df.withColumn("ret", splitKeyVal($"value"))
.withColumn("key", $"ret.key")
.withColumn("vals", $"ret.vals")
.groupBy("key").agg(collect_list($"vals").as("val"))
.withColumn("val", explode(groupVals(nGroup)($"val")))
.select(concat($"key", lit(" "), concat_ws(" ", $"val")).as("col"))
The last row will first concatenate the values and then add the key to obtain a single column (here called col
). Result:
+--------------------------------+
|col |
+--------------------------------+
|0000004 dd______40 M 58 71 20 10|
|0000000 aa______50 F 91 59 20 76|
|0000003 cc______26 F 30 50 71 36|
|0000001 bb______50 F 46 39 8 5 |
+--------------------------------+
Setting nGroup=2
gives:
+---------------------------+
|col |
+---------------------------+
|0000001 bb______ 50 F 46 39|
|0000001 bb______ 50 F 8 5 |
|0000003 cc______ 26 F 30 50|
|0000003 cc______ 26 F 71 36|
|0000000 aa______ 50 F 91 59|
|0000000 aa______ 50 F 20 76|
|0000004 dd______ 40 M 58 71|
|0000004 dd______ 40 M 20 10|
+---------------------------+