I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:
{
"userId": "12345",
"vars": {
"test_group": "group1",
"brand": "xband"
},
"modules": [
{
"id": "New"
},
{
"id": "Default"
},
{
"id": "BestValue"
},
{
"id": "Rating"
},
{
"id": "DeliveryMin"
},
{
"id": "Distance"
}
]
}
I would like to pass in to a method a list of module id-s and clear out all items, which don't make part of that list of module id-s. It should remove all other modules, which's id is not equal to any of the values from the passed in list.
Would you have a solution for it?
As you know from Deleting nested array entries in a DataFrame (JSON) on a condition the way of reading your json
file and manipulation of modules
column which has schema
of
root
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
which says that modules
is a collection of struct[String]
. For the current requirement you will have to convert the Array[struct[String]]
to Array[String]
val finaldf = df.withColumn("modules", explode($"modules.id"))
.groupBy("userId", "vars").agg(collect_list("modules").as("modules"))
Next step would be define a udf
function as
def contains = udf((list: mutable.WrappedArray[String]) => {
val validModules = ??? //your array definition here for example : Array("Default", "BestValue")
list.filter(validModules.contains(_))
})
And just call the udf
function as
finaldf.withColumn("modules", contains($"modules")).show(false)
That should be it. I hope the answer is helpful.