Recently I found out about great dplyr.spark.hive
package that enables dplyr
frontend operations with spark
or hive
backend .
There is an information on how to install this package in package's README :
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
and there are also many examples on how to work with dplyr.spark.hive
when one is already connected to hiveServer
- check this.
But I am not able to connect to hiveServer
, so I can not benefit from the great power of this package...
I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
EDIT 2
I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My log properties are not empty.
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
EDIT 3
My exact call to the scr_SparkSQL()
is
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
And then the proces does not stop (never). Where those settings work for beeline with such params:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
but I am not able to pass user
name and dbname
to src_SparkSQL()
so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
Maybe the url
should containg also /loghost
- the dbname
?