Create spark dataframe schema from json schema rep

2019-01-23 08:37发布

问题:

Is there a way to serialize a dataframe schema to json and deserialize it later on?

The use case is simple: I have a json configuration file which contains the schema for dataframes I need to read. I want to be able to create the default configuration from an existing schema (in a dataframe) and I want to be able to generate the relevant schema to be used later on by reading it from the json string.

回答1:

There are two steps for this: Creating the json from an existing dataframe and creating the schema from the previously saved json string.

Creating the string from an existing dataframe

val schema = df.schema
val jsonString = schema.json

create a schema from json

import org.apache.spark.sql.types.{DataType, StructType}
val newSchema = DataType.fromJson(jsonString).asInstanceOf[StructType]


回答2:

I am posting a pyspark version to a question answered by Assaf:

# Save schema from the original DataFrame into json:
schema_json = df.schema.json()

# Restore schema from json:
import json
new_schema = StructType.fromJson(json.loads(schema_json))