Why there are many spark-warehouse folders got cre

2019-05-21 18:52发布

问题:

I have installed hadoop 2.8.1 on ubuntu and then installed spark-2.2.0-bin-hadoop2.7 on it. I used spark-shell and created the tables. Again I used beeline and created tables. I have observed that there are three different folders got created named spark-warehouse as :

1- spark-2.2.0-bin-hadoop2.7/spark-warehouse

2- spark-2.2.0-bin-hadoop2.7/bin/spark-warehouse

3- spark-2.2.0-bin-hadoop2.7/sbin/spark-warehouse

What is exactly spark-warehouse and why are these created many times? Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same. I am not getting what is happening?

Further, I did not installed hive but still I am able to use beeline and also I can access the databases though java program. How the hive came on my machine? Please help me. I am new to spark and installed it by online tutorials.

Below is the java code I was using to connect apache spark though JDBC:

 private static String driverName = "org.apache.hive.jdbc.HiveDriver";

public static void main(String[] args) throws SQLException {
    try {
        Class.forName(driverName);
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        System.exit(1);
    }
    Connection con = DriverManager.getConnection("jdbc:hive2://10.171.0.117:10000/default", "", "");
    Statement stmt = con.createStatement();

回答1:

What is exactly spark-warehouse and why are these created many times?

Unless configured otherwise, Spark will create an internal Derby database named metastore_db with a derby.log. Looks like you've not changed that.

This is the default behavior, as point out in the Documentation

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started

Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same

You're starting those commands in those different folders, so what you see is only confined to the current working directory.

I used beeline and created tables... How the hive came on my machine?

It didn't. You're probably connecting to the either the Spark Thrift Server, which is fully compatible with HiveServer2 protocol, the Derby database, as mentioned, or, you actually do have a HiveServer2 instance sitting at 10.171.0.117

Anyways, the JDBC connection is not required here. You can use SparkSession.sql function directly.



回答2:

In standalone mode, Spark will create the metastore in the directory from where it was launched. This is explained here: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

So you should set spark.sql.warehouse.dir, or simply make sure you always start your spark job from the same directory (run bin/spark instead of cd bin ; ./spark, etc.).