I have some data in a database, and I want to work with it in Spark, using sparklyr.
I can use a DBI-based package to import the data from the database into R
dbconn <- dbConnect(<some connection args>)
data_in_r <- dbReadTable(dbconn, "a table")
then copy the data from R to Spark using
sconn <- spark_connect(<some connection args>)
data_ptr <- copy_to(sconn, data_in_r)
Copying twice is slow for big datasets.
How can I copy data directly from the database into Spark?
sparklyr has several spark_read_*()
functions for import, but nothing database related. sdf_import()
looks like a possibility, but it isn't clear how to use it in this context.
Sparklyr >= 0.6.0
You can use
spark_read_jdbc
.Sparklyr < 0.6.0
I hope there is a more elegant solution out there but here is a minimal example using low level API:
Make sure that Spark has access to the required JDBC driver, for example by adding its coordinates to
spark.jars.packages
. For example with PostgreSQL (adjust for current version) you could add:to
SPARK_HOME/conf/spark-defaults.conf
Load data and register as temporary view:
You can pass multiple
options
at once using anenvironment
:Load temporary view with
dplyr
:Be sure to read about further JDBC options, with focus on
partitionColumn
,*Bound
andnumPartitions
.For additional details see for example How to use JDBC source to write and read data in (Py)Spark? and How to improve performance for slow Spark jobs using DataFrame and JDBC connection?