How to Access Hive via Python?

2019-01-10 00:40发布

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

When I add this to /etc/profile:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

I can then do the imports as listed in the link, with the exception of from hive import ThriftHive which actually need to be:

from hive_service import ThriftHive

Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

So I set it up like so:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

I received the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

But inspecting the ThriftHive.py file reveals the method execute within the Client class.

How may I use Python to access Hive?

14条回答
别忘想泡老子
2楼-- · 2019-01-10 01:07

To connect using a username/password and specifying ports, the code looks like this:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())
查看更多
Viruses.
3楼-- · 2019-01-10 01:08

pyhs2 is no longer maintained. A better alternative is impyla

https://github.com/cloudera/impyla

It has many more features over pyhs2, for example, it has Kerberos authentication, which is a must for us.

from impala.dbapi import connect
conn = connect(host='my.host.com', port=10000)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()

##
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
    process(row)

Cloudera is putting more effort now on hs2 client https://github.com/cloudera/hs2client which is a C/C++ HiveServer2/Impala client. Might be a better option if you push a lot of data to/from python. (has Python binding too - https://github.com/cloudera/hs2client/tree/master/python )

Some more information on impyla:

Don't be confused that some of the above examples talk about Impala; just change port to 10000 (default) for HiveServer2, and it'll work the same way as with Impala examples. It's the same protocol (Thrift) that is used for both Impala and Hive.

查看更多
太酷不给撩
4楼-- · 2019-01-10 01:08

This can be a quick hack to connect hive and python,

from pyhive import hive
cursor = hive.connect('YOUR_HOST_NAME').cursor()
cursor.execute('SELECT * from table_name LIMIT 5',async=True)
print cursor.fetchall()

Output: List of Tuples

查看更多
SAY GOODBYE
5楼-- · 2019-01-10 01:08

You could use python JayDeBeApi package to create DB-API connection from Hive or Impala JDBC driver and then pass the connection to pandas.read_sql function to return data in pandas dataframe.

import jaydebeapi
# Apparently need to load the jar files for the first time for impala jdbc driver to work 
conn = jaydebeapi.connect('com.cloudera.hive.jdbc41.HS2Driver',
['jdbc:hive2://host:10000/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=hive;KrbRealm=xxx.COM', "",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

# the previous call have initialized the jar files, technically this call needs not include the required jar files
impala_conn = jaydebeapi.connect('com.cloudera.impala.jdbc41.Driver',
['jdbc:impala://host:21050/db;AuthMech=1;KrbHostFQDN=xxx.com;KrbServiceName=impala;KrbRealm=xxx.COM',"",""],
jars=['/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/HiveJDBC41.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/TCLIServiceClient.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-codec-1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_metastore.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/hive_service.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/log4j-1.2.14.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/ql.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/hive_jdbc_2.5.18.1050/2.5.18.1050 GA/Cloudera_HiveJDBC41_2.5.18.1050/zookeeper-3.4.6.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ImpalaJDBC41.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/TCLIServiceClient.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-codec-1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/commons-logging-1.1.1.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_metastore.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/hive_service.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpclient-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/httpcore-4.1.3.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libfb303-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/libthrift-0.9.0.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/log4j-1.2.14.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/ql.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-api-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/slf4j-log4j12-1.5.11.jar',
'/hadp/opt/jdbc/impala_jdbc_2.5.35/2.5.35.1055 GA/Cloudera_ImpalaJDBC41_2.5.35/zookeeper-3.4.6.jar'
])

import pandas as pd
df1 = pd.read_sql("SELECT * FROM tablename", conn)
df2 = pd.read_sql("SELECT * FROM tablename", impala_conn)

conn.close()
impala_conn.close()
查看更多
做自己的国王
6楼-- · 2019-01-10 01:10

similar to @python-starter solution. But, commands package is not avilable on python3.x. So Alternative solution is to use subprocess in python3.x

import subprocess

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = subprocess.getstatusoutput(cmd)

if status == 0:
   print(output)
else:
   print("error")
查看更多
小情绪 Triste *
7楼-- · 2019-01-10 01:13

You can use hive library,for that you want to import hive Class from hive import ThriftHive

Try This example:

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
  transport = TSocket.TSocket('localhost', 10000)
  transport = TTransport.TBufferedTransport(transport)
  protocol = TBinaryProtocol.TBinaryProtocol(transport)
  client = ThriftHive.Client(protocol)
  transport.open()
  client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
  client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
  client.execute("SELECT * FROM r")
  while (1):
    row = client.fetchOne()
    if (row == None):
       break
    print row

  client.execute("SELECT * FROM r")
  print client.fetchAll()
  transport.close()
except Thrift.TException, tx:
  print '%s' % (tx.message)
查看更多
登录 后发表回答