I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark.
My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON.
JSON schema sample
Note - This is not the exact schema. Its just to give the idea of nested nature of the schema
{
"tweet": {
"text": "RT @author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
PySpark Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
Current output - (with real JSON data)
All (keys:values) under nested retweet_status JSON are squashed into 1 single list. eg [text, created_at, entities]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT @Hoosier602: @...|Mon Feb 11 19:04:...|[@CLeroyjnr @Gabr...|
|RT @EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT @JacobAWohl: @...|Mon Feb 11 19:04:...|[@realDonaldTrump...|
Expected output
I want independent columns for each key. Also, note that you already have a parent level key by the same name text. How will you deal with such instances?
Ideally, I would want columns like "text", "entities", "retweet_status_text", "retweet_status_entities", etc