pyspark; check if an element is in collect_list [d

2020-07-10 09:43发布

This question already has answers here:

How to filter based on array value in PySpark? (2 answers)

I am working on a dataframe df, for instance the following dataframe:

df.show()

Output:

+----+------+
|keys|values|
+----+------+
|  aa| apple|
|  bb|orange|
|  bb|  desk|
|  bb|orange|
|  bb|  desk|
|  aa|   pen|
|  bb|pencil|
|  aa| chair|
+----+------+

I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))

The resulting dataframe is then as follows:

df_new.show()

Output:

+----+----------------------+
|keys|collectedSet_values   |
+----+----------------------+
|bb  |[orange, pencil, desk]|
|aa  |[apple, pen, chair]   |
+----+----------------------+

I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.

Please comment your solutions/ideas.

Kind Regards.

+----+----------------------+--------------+ |keys|collectedSet_values |contains_chair| +----+----------------------+--------------+ |bb |[orange, pencil, desk]|false | |aa |[apple, pen, chair] |true | +----+----------------------+--------------+