Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the
joinWithCassandraTable
functionhttps://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable