In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.
Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);
and use map operation on dfResult dataset.
While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.
May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?
Thanks
Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.
# pseudo-code
rdd.foreachPartition(iterator =>
connection = SomeAPI.connect()
for i in iterator:
connection.insert(i)
)
And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.
Here's another example of this pattern being recommended:
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs (Slide 27)
I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.