I have a Dataframe like this:
+--+--------+--------+----+-------------+------------------------------+
|id|name |lastname|age |timestamp |creditcards |
+--+--------+--------+----+-------------+------------------------------+
|1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]|
|2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]|
+--+--------+--------+----+-------------+------------------------------+
where the schema of my df is like below:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- lastname: string (nullable = true)
|-- age: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- creditcards: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- number: string (nullable = true)
I would like to convert each line to a json string knowing my schema. So this dataframe would have one column string containing the json. first line should be like this:
{
"id":"1",
"name":"michel",
"lastname":"blanc",
"age":"35",
"timestamp":"1496756626921",
"creditcards":[
{
"id":"hr6",
"number":"3569823"
},
{
"id":"ee3",
"number":"1547869"
}
]
}
and the secone line of the dataframe should be like this:
{
"id":"2",
"name":"peter",
"lastname":"barns",
"age":"25",
"timestamp":"1496756626551",
"creditcards":[
{
"id":"ye8",
"number":"4569872"
},
{
"id":"qe5",
"number":"3485762"
}
]
}
my goal is not to write the dataframe to json file. My goal is to convert df1 to a second df2 in order to push each json line of df2 to kafka topic I have this code to create the dataframe:
val line1 = """{"id":"1","name":"michel","lastname":"blanc","age":"35","timestamp":"1496756626921","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}]}"""
val line2 = """{"id":"2","name":"peter","lastname":"barns","age":"25","timestamp":"1496756626551","creditcards":[{"id":"ye8","number":"4569872"}, {"id":"qe5","number":"3485762"}]}"""
val rdd = sc.parallelize(Seq(line1, line2))
val df = sqlContext.read.json(rdd)
df show false
df printSchema
Do you have any idea?