java.sql.SQLException: No suitable driver found wh

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.

I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.

Consider following sample:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"

val t1 = {
  sqlContext.load("jdbc", Map(
    "url" -> url,
    "driver" -> driver,
    "dbtable" -> "t1",
    "partitionColumn" -> "id",
    "lowerBound" -> "0",
    "upperBound" -> "100",
    "numPartitions" -> "50"
  ))
}

So far so good, the schema gets resolved properly:

t1: org.apache.spark.sql.DataFrame = [id: int, name: string]

But when I evaluate DataFrame:

t1.take(1)

Following exception occurs:

15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:270)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
    at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

When I try to open JDBC connection on executor:

import java.sql.DriverManager

sc.parallelize(0 until 2, 2).map { i =>
  Class.forName(driver)
  val conn = DriverManager.getConnection(url)
  conn.close()
  i
}.collect()

it works perfectly:

res1: Array[Int] = Array(0, 1)

When I run the same code on local Spark, it works perfectly too:

scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])

I'm using Spark pre-built with Hadoop 2.4 support.

The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh script and run following command:

/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar

Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.

回答1:

Apparently this issue has been recently reported:

https://issues.apache.org/jira/browse/SPARK-6913

The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.

As a temporary workaround it's possible to add required drivers to boot classpath of executors.

UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782

UPDATE 2: The fix merged to Spark 1.4

回答2:

For writing data to MySQL

In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function. We have to put jar on every worker node and set the path in spark-defaults.conf file on each node. This issue has been fixed in spark 1.5.0

https://issues.apache.org/jira/browse/SPARK-10036

回答3:

We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.

Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.

回答4:

I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf file.