Recently I found out about great dplyr.spark.hive
package that enables dplyr
frontend operations with spark
or hive
backend .
There is an information on how to install this package in package's README :
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
and there are also many examples on how to work with dplyr.spark.hive
when one is already connected to hiveServer
- check this.
But I am not able to connect to hiveServer
, so I can not benefit from the great power of this package...
I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
EDIT 2
I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My log properties are not empty.
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
EDIT 3
My exact call to the scr_SparkSQL()
is
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
And then the proces does not stop (never). Where those settings work for beeline with such params:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
but I am not able to pass user
name and dbname
to src_SparkSQL()
so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
Maybe the url
should containg also /loghost
- the dbname
?
I now see that you tried multiple things with multiple errors. Let me comment error by error.
The RJDBC object could not be created. Unless we solve this, nothing else will work, workarounds or not. Have you set HADOOP_JAR with, for instance,
Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar")
. Sorry I seem to have skipped this in the instructions. Will fix.Same problem. Please note host port argument do not accept URL syntax, just host and port. URL is formed internally.
Stop thriftserver first or connect to existing one, but you still have to fix the class not found problem.
Same as above.
Plan:
Let me know how things go.
There was a problem that I did't specify the proper
classPath
that was needed insideJDBC
function that created a driver. Parameters toclassPath
indplyr.spark.hive
package are passed viaHADOOP_JAR
global variable.To use
JDBC
as a driver tohiveServer2
(through theThrift
protocol) one need to add at least those 3.jars
withJava
classes to create a proper driverversions are arbitrary and should be compatible with the installed version of local
hive
,hadoop
andhiveServer2
.They need to be set with the
.Platform$path.sep
(as described here)Then when
HADOOP_JAR
is set one have to be carefull withhiveServer2
url. In my case it had to beand finally the proper connection with
hiveServer2
usingRJDBC
package is