I have 2 dataframes df1
and df2
.
Suppose there is a location
column in df1
which may contain a regular URL or a URL with a wildcard, e.g.:
- stackoverflow.com/questions/*
- *.cnn.com
- cnn.com/*/politics
The seconds dataframe df2
has url
field which may contain only valid URLs without wildcards.
I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url")
if there was magic matches
operator in join conditions.
After some googling I still don't see a way how to achieve this. How would you approach solving such problem?
There exist "magic" matches operator - it is called rlike
val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")
df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
| url| location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+
however there are some caveats:
- Patterns have to be proper regular expressions, anchored in case of leading / trailing wildcards.
It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
: +- *Filter isnotnull(value#215)
: +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [value#209 AS location#211]
+- *Filter isnotnull(value#209)
+- LocalTableScan [value#209]
It is possible to filter candidates using method I described in Efficient string matching in Apache Spark