Spark jdbc reuse connection

2019-07-11 06:47发布

问题:

In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.

 Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);

and use map operation on dfResult dataset.

While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.

May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?

Thanks

回答1:

Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.

# pseudo-code
rdd.foreachPartition(iterator =>
  connection = SomeAPI.connect()
  for i in iterator:
    connection.insert(i)
)

And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.

Here's another example of this pattern being recommended:

http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs (Slide 27)

I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.

Spark jdbc reuse connection

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮