Connection b/w R studio server pro and hive on GCP

2019-08-02 10:21发布

问题:

This is not a programming related question, please bear with me on this.

I have currently set up two instances on GCP - one is R studio server PRO and the other is my cluster with Hive DB. I wish to access the database in hive with my rstudio server pro. Both being run on GCP.

Can anyone please guide me on this ? (I have seen articles for rstudio desktop ---> hive connection and also for running rstudio-server from within a spark cluster, but i have to link Rstudio server PRO to hive db, both running on GCP :O )

回答1:

For future reference : R studio - Dataproc -

In this particular case, I am pushing data from HiveDB into Spark and using sparklyr package to establish a connection in R studio server within the same cluster. You may also check "Hive-R-JDBC" connection, if you wish to directly connect with Hive.

GCP offers R studio server PRO on compute engine, but it is not cost efficient. I had used it for about 8hrs and was billed $21 approx. 5 days a week and you're looking at > $ 100. I hope the following steps will help you :

R studio runs on port 8787. You will have to add this port to your firewall network rule. Scroll over to hamburger icon in your GCP and scroll down to VPC Networks, click on firewall rules and add 8787. It should look like this afterwards

Set up a dataproc cluster based on your requirements and location. And then either SSH into browser window or run though gcloud command line. Just press enter when its prompts to run in cloud shell.

Once you are in the window/gcloud command line, add a user for R server:

 sudo adduser rstudio 

Set a password for it. Remember it.

Next go the R studio website, link : https://dailies.rstudio.com/ and click on Ubuntu for R studio server. Copy the link address

Go back to your window/command line and install it. Paste the link address after sudo wget like so :

sudo wget https://s3.amazonaws.com/rstudio-ide-build/server/trusty/amd64/rstudio-server-1.2.650-amd64.deb

Then run :

sudo apt-get install gdebi-core

Followed by : Note this is the r version from link above.

sudo gdebi rstudio-server-1.2.650-amd64.deb

Press yes to accept and you should see a message R server active (running). Now navigate to Compute Engine tab in GCP and copy the external IP of your master cluster (first one). Now open a new browser and enter :

http://<yourexternalIPaddress>:8787 

This should open R studio server, now enter the used id as "rstudio" and the password which you set up earlier. Now you have R studio server up and running from your data proc cluster.

**Hive **:

Go back to the terminal and enter

     beeline -u jdbc:hive2://localhost:10000/default -n *myusername*@*clustername-m* -d org.apache.hive.jdbc.HiveDriver  

We shall import data into Hive from our HDFS i.e Google cloud storage. Here we are simply copying the data from our bucket into our hive table. Enter command :

 CREATE EXTERNAL TABLE <giveatablename>
    (location CHAR(1),
     dept CHAR(1),
     eid INT,
     emanager VARCHAR(6))
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 'gs://<yourgooglestoragebucket>/<foldername>/<filename.csv>';

Now you have a table in Hive yourtablename with features -> location, dept, eid and emanager -> from a csv file in your google cloud storage -> gs://

Now exit from hive (CTRL+Z) and type in :

    ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml

This is a link to your configuration file in hive to spark. It is better to do this than to copy the files into the location. As there may be confusion.

Spark :

Log into the spark-shell by typing:

     spark-shell 

Now enter :

    spark.catalog.listTables.show 

To check if table from your HiveDb is there or not.

Now go to the Rstudio server browser and run the following commands :

  library(sparklyr)
  library(dplyr)
  sparklyr::spark_install()
  #config
  Sys.setenv(SPARK_HOME="/usr/lib/spark")
  config <- spark_config()
  #connect
  sc <- spark_connect(master="yarn-client",config = config,version="2.2.1")

Now to the right hand side, you will see a new tab called "Connection" next to Environment. This is your spark cluster connection, click on it and it should show yourtablename from Hive.