I want to explore my data in Redshift using notebook Zeppelin. A small EMR cluster with Spark is running behind. I am loading databricks' spark-redshift library
%dep
z.reset()
z.load("com.databricks:spark-redshift_2.10:0.6.0")
and then
import org.apache.spark.sql.DataFrame
val query = "..."
val url = "..."
val port=5439
val table = "..."
val database = "..."
val user = "..."
val password = "..."
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", s"jdbc:redshift://${url}:$port/$database?user=$user&password=$password")
.option("query",query)
.option("tempdir", "s3n://.../tmp/data")
.load()
df.show
but I get the error
java.lang.ClassNotFoundException: Could not load an Amazon Redshift JDBC driver; see the README for instructions on downloading and configuring the official Amazon driver
I added option
option("jdbcdriver", "com.amazon.redshift.jdbc41.Driver")
but not for the better. I think I need to specify redshift's JDBC driver somewhere like I would passing --driver-class-path to spark-shell, but how to do that with zeppelin?
You can add external jars with dependencies like the JDBC driver using either Zeppelin's dependency-loading mechanism or, in case of Spark, using
%dep
dynamic dependency loaderThe latter would look something like:
and by convention have to be in the first paragraph of the note.