Filtering out nested array entries in a DataFrame

2019-09-19 23:20发布

问题:

I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

I would like to pass in to a method a list of module id-s and clear out all items, which don't make part of that list of module id-s. It should remove all other modules, which's id is not equal to any of the values from the passed in list.

Would you have a solution for it?

回答1:

As you know from Deleting nested array entries in a DataFrame (JSON) on a condition the way of reading your json file and manipulation of modules column which has schema of

root
 |-- modules: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)

which says that modules is a collection of struct[String]. For the current requirement you will have to convert the Array[struct[String]] to Array[String]

val finaldf = df.withColumn("modules", explode($"modules.id"))
                  .groupBy("userId", "vars").agg(collect_list("modules").as("modules"))

Next step would be define a udf function as

def contains = udf((list: mutable.WrappedArray[String]) => {
  val validModules = ??? //your array definition here for example : Array("Default", "BestValue")
  list.filter(validModules.contains(_))
})

And just call the udf function as

finaldf.withColumn("modules", contains($"modules")).show(false)

That should be it. I hope the answer is helpful.