问题:

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:

input_path = <s3_location_str>
my_expr = "Arizona.*hot"  # a regex expression
dx = sqlContext.read.parquet(input_path)  # "keyword" is a field in dx

# is the following correct?
substr = "'%%%s%%'" %my_keyword  # escape % via %% to get "%"
dk = dx.filter("keyword like %s" %substr)

# dk should contain rows with keyword values such as "Arizona is hot."

Note

I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.

回答1:

From neeraj's hint, it seems like the correct way to do this in pyspark is:

expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))

Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.

回答2:

Try rlike function as mentioned below.

df.filter(<column_name> rlike "<regex_pattern>")

for example.

dk = dx.filter($"keyword" rlike "<pattern>")

Pyspark: filter dataframe by regex with string for

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮