Deleting nested array entries in a DataFrame (JSON

2019-08-21 06:52发布

问题:

I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

How could I manipulate in such way the DataFrame, to keep only the module with id="Default" ? How to just delete all the other, if id does not equal "Default"?

回答1:

As you said you have json format given in question in each line as

{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}

If thats true then you can use sqlContext's json api to read the json file to dataframe as below

val df = sqlContext.read.json("path to json file")

which should give you dataframe as

+--------------------------------------------------------------------+------+--------------+
|modules                                                             |userId|vars          |
+--------------------------------------------------------------------+------+--------------+
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
+--------------------------------------------------------------------+------+--------------+

and schema be

root
 |-- modules: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |-- userId: string (nullable = true)
 |-- vars: struct (nullable = true)
 |    |-- brand: string (nullable = true)
 |    |-- test_group: string (nullable = true)

Final step would be to filter only the modules.id with Default as value

val finaldf = df.withColumn("modules", explode($"modules.id"))
    .filter($"modules" === "Default")

which should give you

+-------+------+--------------+
|modules|userId|vars          |
+-------+------+--------------+
|Default|12345 |[xband,group1]|
|Default|12345 |[xband,group1]|
+-------+------+--------------+

I hope the answer is helpful

Updated

this would create json as

{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}

But if your requirement is to get as below

{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}

You should be exploding the modules and not modules.id

val finaldf = df.withColumn("modules", explode($"modules"))
    .filter($"modules.id" === "Default")