I have the following dataframe in spark:
val test = sqlContext.read.json(path = "/path/to/jsonfiles/*")
test.printSchema
root
|-- properties: struct (nullable = true)
| |-- prop_1: string (nullable = true)
| |-- prop_2: string (nullable = true)
| |-- prop_3: boolean (nullable = true)
| |-- prop_4: long (nullable = true)
...
What I would like to do is flatten this dataframe so that the prop_1 ... prop_n
exist at the top level. I.e.
test.printSchema
root
|-- prop_1: string (nullable = true)
|-- prop_2: string (nullable = true)
|-- prop_3: boolean (nullable = true)
|-- prop_4: long (nullable = true)
...
There are several solutions to similar problems. The best I can find is posed here. However, solution only works if properties
is of type Array
. In my case, properties is of type StructType
.
An alternate approach would be something like:
test.registerTempTable("test")
val test2 = sqlContext.sql("""SELECT properties.prop_1, ... FROM test""")
But in this case I have to explicitly specify each row, and that is inelegant.
What is the best way to solve this problem?