https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.
When I add this to /etc/profile:
export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py
I can then do the imports as listed in the link, with the exception of from hive import ThriftHive
which actually need to be:
from hive_service import ThriftHive
Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.
So I set it up like so:
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
transport = TSocket.TSocket('<node-with-metastore>', 9083)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
client.execute("CREATE TABLE test(c1 int)")
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
I received the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'
But inspecting the ThriftHive.py file reveals the method execute within the Client class.
How may I use Python to access Hive?
To connect using a username/password and specifying ports, the code looks like this:
pyhs2 is no longer maintained. A better alternative is impyla
https://github.com/cloudera/impyla
It has many more features over pyhs2, for example, it has Kerberos authentication, which is a must for us.
Cloudera is putting more effort now on hs2 client https://github.com/cloudera/hs2client which is a C/C++ HiveServer2/Impala client. Might be a better option if you push a lot of data to/from python. (has Python binding too - https://github.com/cloudera/hs2client/tree/master/python )
Some more information on impyla:
Don't be confused that some of the above examples talk about Impala; just change port to 10000 (default) for HiveServer2, and it'll work the same way as with Impala examples. It's the same protocol (Thrift) that is used for both Impala and Hive.
This can be a quick hack to connect hive and python,
Output: List of Tuples
You could use python JayDeBeApi package to create DB-API connection from Hive or Impala JDBC driver and then pass the connection to pandas.read_sql function to return data in pandas dataframe.
similar to @python-starter solution. But, commands package is not avilable on python3.x. So Alternative solution is to use subprocess in python3.x
You can use hive library,for that you want to import hive Class from hive import ThriftHive
Try This example: