I am new to Spark SQL but aware of hive query execution framework. I would like to understand how does spark executes sql queries (technical description) ?
If I fire below command
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select count(distinct(id)) from test.emp").collect
In Hive it will be converted into Map-Reduce job but how it gets executed in Spark?
How hive metastore will come into picture?
Thanks in advance.
To answer you question briefly: No, HiveContext will not start MR job. Your SQL query will still use the spark engine
I will quote from the spark documentation:
In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build. If these dependencies are not a problem for your application then using HiveContext is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext
So The HiveContext is used by spark to enhance the query parsing and accessing to existing Hive tables, and even to persist your result DataFrames / Tables. Actually, moreover, Hive could use Spark as its execution engine instead of using MR or tez.
Hive metastore is a metadata about Hive tables. And when using HiveContext spark could use this metastore service. Refer to the documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html