Filter pyspark dataframe if contains a list of str

2020-08-01 08:29发布

Suppose that we have a pyspark dataframe that one of its columns (column_a) contains some string values, and also there is a list of strings (list_a).

Dataframe:

column_a      | count
some_string   |  10
another_one   |  20
third_string  |  30

list_a:

['string', 'third', ...]

I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.

This is the code that works to filter the column_a based on a single string:

df['column_a'].like('%string_value%')

But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)

1条回答
▲ chillily
2楼-- · 2020-08-01 08:56

IIUC, you want to return the rows in which column_a is "like" (in the SQL sense) any of the values in list_a.

One way is to use functools.reduce:

from functools import reduce

list_a = ['string', 'third']

df1 = df.where(
    reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

Essentially you loop over all of the possible strings in list_a to compare in like and "OR" the results. Here is the execution plan:

df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]

Another option is to use pyspark.sql.Column.rlike instead of like.

df2 = df.where(
    df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)

df2.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

Which has the corresponding execution plan:

df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]
查看更多
登录 后发表回答