How to parse jsonfile with spark

2019-09-13 00:53发布

问题:

I have a jsonfile to be parsed.The json format is like this :

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}

I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?

This is the result I want:

cv_id   major   degree  birthyear   state
001   English   Bachelor  1984     New York
001   English   Master    1984     New York

回答1:

This might not be the best way of doing it but you can give it a shot.

// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._

//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")

jsonDf.printSchema

Your schema would be :

root
 |-- cv_id: string (nullable = true)
 |-- cv_parse: struct (nullable = true)
 |    |-- basic_info: struct (nullable = true)
 |    |    |-- birthyear: string (nullable = true)
 |    |    |-- location: struct (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |-- educations: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- degree: string (nullable = true)
 |    |    |    |-- major: string (nullable = true)

Now you need can have explode the educations column

 val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
      $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")

  explodedResult.printSchema

Now your schema would be

 root
 |-- cv_id: string (nullable = true)
 |-- col: struct (nullable = true)
 |    |-- degree: string (nullable = true)
 |    |-- major: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- state: string (nullable = true)

Now you can select the columns

explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show

+-----+---------+--------+--------+-------+
|cv_id|birthyear|   state|  degree|  major|
+-----+---------+--------+--------+-------+
|  001|     1984|New York|Bachelor|English|
|  001|     1984|New York| Master |English|
+-----+---------+--------+--------+-------+