I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:
{
"userId": "12345",
"vars": {
"test_group": "group1",
"brand": "xband"
},
"modules": [
{
"id": "New"
},
{
"id": "Default"
},
{
"id": "BestValue"
},
{
"id": "Rating"
},
{
"id": "DeliveryMin"
},
{
"id": "Distance"
}
]
}
How could I manipulate in such way the DataFrame, to keep only the module with id="Default" ? How to just delete all the other, if id does not equal "Default"?
As you said you have json
format given in question in each line as
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
If thats true then you can use sqlContext
's json
api to read the json
file to dataframe
as below
val df = sqlContext.read.json("path to json file")
which should give you dataframe
as
+--------------------------------------------------------------------+------+--------------+
|modules |userId|vars |
+--------------------------------------------------------------------+------+--------------+
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
+--------------------------------------------------------------------+------+--------------+
and schema
be
root
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
|-- userId: string (nullable = true)
|-- vars: struct (nullable = true)
| |-- brand: string (nullable = true)
| |-- test_group: string (nullable = true)
Final step would be to filter
only the modules.id
with Default
as value
val finaldf = df.withColumn("modules", explode($"modules.id"))
.filter($"modules" === "Default")
which should give you
+-------+------+--------------+
|modules|userId|vars |
+-------+------+--------------+
|Default|12345 |[xband,group1]|
|Default|12345 |[xband,group1]|
+-------+------+--------------+
I hope the answer is helpful
Updated
this would create json
as
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
But if your requirement is to get as below
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
You should be exploding the modules
and not modules.id
val finaldf = df.withColumn("modules", explode($"modules"))
.filter($"modules.id" === "Default")