Config file to define JSON Schema Structure in PyS

2020-02-07 03:54发布

问题:

I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below

schema = StructType([
    StructField("domain", StringType(), True),
     StructField("timestamp", LongType(), True),                            
])
df= sqlContext.read.json(file, schema)

I need a way to find how can I define this schema in a kind of config or ini file etc. And read that in the main the PySpark application.

This will help me to modify schema for the changing JSON if there is any need in future without changing the main PySpark code.

回答1:

StructType provides json and jsonValue methods which can be used to obtain json and dict representation respectively and fromJson which can be used to convert Python dictionary to StructType.

schema = StructType([
    StructField("domain", StringType(), True),
    StructField("timestamp", LongType(), True),                            
])

StructType.fromJson(schema.jsonValue())

The only thing you need beyond that is built-in json module to parse input to the dict that can be consumed by StructType.

For Scala version see How to create a schema from CSV file and persist/save that schema to a file?