I am working on a dataframe df
, for instance the following dataframe:
df.show()
Output:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
I use collect_set
to aggregate and get a set of objects with duplicate elements eliminated (or collect_list
to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
The resulting dataframe is then as follows:
df_new.show()
Output:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values
). I do not want to go with udf
solution.
Please comment your solutions/ideas.
Kind Regards.