I have JSON data in the following format:
{
"date": 100
"userId": 1
"data": [
{
"timeStamp": 101,
"reading": 1
},
{
"timeStamp": 102,
"reading": 2
}
]
}
{
"date": 200
"userId": 1
"data": [
{
"timeStamp": 201,
"reading": 3
},
{
"timeStamp": 202,
"reading": 4
}
]
}
I read it into Spark SQL:
val df = SQLContext.read.json(...)
df.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- data: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- timeStamp: double (nullable = true)
// | | |-- reading: double (nullable = true)
I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work:
import org.apache.spark.sql.functions.explode
val exploded = df
.withColumn("reading", explode(df("data.reading")))
.withColumn("timeStamp", explode(df("data.timeStamp")))
.drop("data")
exploded.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- timeStamp: double (nullable = true)
// |-- reading: double (nullable = true)
The resulting schema is correct, but I get every value twice:
exploded.show
// +-----------+-----------+-----------+-----------+
// | date| userId| timeStamp| reading|
// +-----------+-----------+-----------+-----------+
// | 100| 1| 101| 1|
// | 100| 1| 101| 1|
// | 100| 1| 102| 2|
// | 100| 1| 102| 2|
// | 200| 1| 201| 3|
// | 200| 1| 201| 3|
// | 200| 1| 202| 4|
// | 200| 1| 202| 4|
// +-----------+-----------+-----------+-----------+
My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand.
Is there a way to get the above code to work? Or should I use a different approach all together?