Spark best approach Look-up Dataframe to improve p

2020-04-30 02:57发布

Dataframe A (millions of records) one of the column is create_date,modified_date

Dataframe B 500 records has start_date and end_date

Current approach:

Select a.*,b.* from a join b on a.create_date between start_date and end_date

The above job takes half hour or more to run.

how can I improve the performance

标签： scala apache-spark cassandra datastax-enterprise

2条回答

女痞

2楼-- · 2020-04-30 03:10

As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.

spark.sql.autoBroadcastJoinThreshold

If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2020-04-30 03:21

DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.

https://issues.apache.org/jira/browse/SPARK-16614

You can use the RDD API to take advantage of the joinWithCassandraTable function

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

0人赞添加讨论(0) 举报

Spark best approach Look-up Dataframe to improve p

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间