Suppose that we have a pyspark dataframe that one of its columns (column_a
) contains some string values, and also there is a list of strings (list_a
).
Dataframe:
column_a | count
some_string | 10
another_one | 20
third_string | 30
list_a:
['string', 'third', ...]
I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.
This is the code that works to filter the column_a
based on a single string:
df['column_a'].like('%string_value%')
But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)
IIUC, you want to return the rows in which
column_a
is "like" (in the SQL sense) any of the values inlist_a
.One way is to use
functools.reduce
:Essentially you loop over all of the possible strings in
list_a
to compare inlike
and "OR" the results. Here is the execution plan:Another option is to use
pyspark.sql.Column.rlike
instead oflike
.Which has the corresponding execution plan: