I have installed hadoop 2.8.1 on ubuntu and then installed spark-2.2.0-bin-hadoop2.7 on it. I used spark-shell and created the tables. Again I used beeline and created tables. I have observed that there are three different folders got created named spark-warehouse as :
1- spark-2.2.0-bin-hadoop2.7/spark-warehouse
2- spark-2.2.0-bin-hadoop2.7/bin/spark-warehouse
3- spark-2.2.0-bin-hadoop2.7/sbin/spark-warehouse
What is exactly spark-warehouse and why are these created many times? Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same. I am not getting what is happening?
Further, I did not installed hive but still I am able to use beeline and also I can access the databases though java program. How the hive came on my machine? Please help me. I am new to spark and installed it by online tutorials.
Below is the java code I was using to connect apache spark though JDBC:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive2://10.171.0.117:10000/default", "", "");
Statement stmt = con.createStatement();
In standalone mode, Spark will create the metastore in the directory from where it was launched. This is explained here: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
So you should set
spark.sql.warehouse.dir
, or simply make sure you always start your spark job from the same directory (runbin/spark
instead ofcd bin ; ./spark
, etc.).Unless configured otherwise, Spark will create an internal Derby database named
metastore_db
with aderby.log
. Looks like you've not changed that.This is the default behavior, as point out in the Documentation
You're starting those commands in those different folders, so what you see is only confined to the current working directory.
It didn't. You're probably connecting to the either the Spark Thrift Server, which is fully compatible with HiveServer2 protocol, the Derby database, as mentioned, or, you actually do have a HiveServer2 instance sitting at
10.171.0.117
Anyways, the JDBC connection is not required here. You can use
SparkSession.sql
function directly.