Created a nested schema in Apache Spark SQL

2019-07-30 02:14发布

问题:

I want to load a simple JSON schema in to my SparkSession which has employee with address array . The sample JSON is below

{"firstName":"Neil","lastName":"Irani", "addresses" : [ {  "city" : "Brindavan", "state" : "NJ"  }, {  "city" : "Subala", "state" : "DT"  }]}

I'm trying to create the schema for loading my JSON, I believe there is something wrong in the below way of creating schema ... please advise .. the below code is in Java ... I could not find a reasonable sample

    List<StructField> employeeFields = new ArrayList<>();
    employeeFields.add(DataTypes.createStructField("firstName", DataTypes.StringType, true));
    employeeFields.add(DataTypes.createStructField("lastName", DataTypes.StringType, true));
    employeeFields.add(DataTypes.createStructField("email", DataTypes.StringType, true));

    List<StructField> addressFields = new ArrayList<>();
    addressFields.add(DataTypes.createStructField("city", DataTypes.StringType, true));
    addressFields.add(DataTypes.createStructField("state", DataTypes.StringType, true));
    addressFields.add(DataTypes.createStructField("zip", DataTypes.StringType, true));

    employeeFields.add(DataTypes.createStructField("addresses", DataTypes.createStructType(addressFields), true));

    StructType employeeSchema = DataTypes.createStructType(employeeFields);


    Dataset<Employee>  rowDataset = sparkSession.read()
            .option("inferSchema", "false")
            .schema(employeeSchema)
            .json("simple_employees.json").as(employeeEncoder);

Update

I was not creating the Array type the below code will work fine

List<StructField> employeeFields = new ArrayList<>();
employeeFields.add(DataTypes.createStructField("firstName", DataTypes.StringType, true));
employeeFields.add(DataTypes.createStructField("lastName", DataTypes.StringType, true));
employeeFields.add(DataTypes.createStructField("email", DataTypes.StringType, true));

List<StructField> addressFields = new ArrayList<>();
addressFields.add(DataTypes.createStructField("city", DataTypes.StringType, true));
addressFields.add(DataTypes.createStructField("state", DataTypes.StringType, true));
addressFields.add(DataTypes.createStructField("zip", DataTypes.StringType, true));
ArrayType addressStruct = DataTypes.createArrayType( DataTypes.createStructType(addressFields));

employeeFields.add(DataTypes.createStructField("addresses", addressStruct, true));
StructType employeeSchema = DataTypes.createStructType(employeeFields);