How to join 2 dataframes in Spark based on a wildc

2020-04-27 02:47发布

问题:

I have 2 dataframes df1 and df2. Suppose there is a location column in df1 which may contain a regular URL or a URL with a wildcard, e.g.:

  • stackoverflow.com/questions/*
  • *.cnn.com
  • cnn.com/*/politics

The seconds dataframe df2 has url field which may contain only valid URLs without wildcards.

I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url") if there was magic matches operator in join conditions.

After some googling I still don't see a way how to achieve this. How would you approach solving such problem?

回答1:

There exist "magic" matches operator - it is called rlike

val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")

df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
|                 url|            location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+

however there are some caveats:

  • Patterns have to be proper regular expressions, anchored in case of leading / trailing wildcards.
  • It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):

    == Physical Plan ==
    BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
    :- *Project [value#215 AS url#217]
    :  +- *Filter isnotnull(value#215)
    :     +- LocalTableScan [value#215]
    +- BroadcastExchange IdentityBroadcastMode
       +- *Project [value#209 AS location#211]
          +- *Filter isnotnull(value#209)
             +- LocalTableScan [value#209]
    

It is possible to filter candidates using method I described in Efficient string matching in Apache Spark