I am using CDH5.5
I have a table created in HIVE default database and able to query it from the HIVE command.
Output
hive> use default;
OK
Time taken: 0.582 seconds
hive> show tables;
OK
bank
Time taken: 0.341 seconds, Fetched: 1 row(s)
hive> select count(*) from bank;
OK
542
Time taken: 64.961 seconds, Fetched: 1 row(s)
However, I am unable to query the table from pyspark as it cannot recognize the table.
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("use default")
DataFrame[result: string]
sqlContext.sql("show tables").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
+---------+-----------+
sqlContext.sql("FROM bank SELECT count(*)")
16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
raise AnalysisException(s.split(': ', 1)[1])
**pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**
New Error
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
return DataFrame(self._ssql_ctx.table(tableName), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
thanks
We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:
To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.
At my problem,
cp
the hive-site.xml to your$SPARK_HOME/conf
, andcp
the mysql-connect-java-*.jar to your$SPARK_HOME/jars
, this solution solved my problem.SparkSQL gets shipped with its own metastore (derby), so that it can work even if hive is not installed on the system.This is the default mode.
In the above question, you created a table in hive. You get the
table not found
error because SparkSQL is using its default metastore which doesn't have metadata of your hive table.If you want SparkSQL to use the hive metastore instead and access hive tables, then you have to add
hive-site.xml
in spark conf folder.you can use sqlCtx.sql. The hive-site.xml should be copied to spark conf path.
my_dataframe = sqlCtx.sql("Select * from categories") my_dataframe.show()
Not sure, if this is not resolved yet, I was checking out the pyspark kernel with Livy integration and this is how i tested the hive configuration