How to Access Hive via Python?

2019-01-10 00:40发布

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

When I add this to /etc/profile:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

I can then do the imports as listed in the link, with the exception of from hive import ThriftHive which actually need to be:

from hive_service import ThriftHive

Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

So I set it up like so:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

I received the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

But inspecting the ThriftHive.py file reveals the method execute within the Client class.

How may I use Python to access Hive?

14条回答
仙女界的扛把子
2楼-- · 2019-01-10 01:14

The examples above are a bit out of date. One new example is here:

import pyhs2 as hive
import getpass
DEFAULT_DB = 'default'
DEFAULT_SERVER = '10.37.40.1'
DEFAULT_PORT = 10000
DEFAULT_DOMAIN = 'PAM01-PRD01.IBM.COM'

u = raw_input('Enter PAM username: ')
s = getpass.getpass()
connection = hive.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='LDAP', user=u + '@' + DEFAULT_DOMAIN, password=s)
statement = "select * from user_yuti.Temp_CredCard where pir_post_dt = '2014-05-01' limit 100"
cur = connection.cursor()

cur.execute(statement)
df = cur.fetchall() 

In addition to the standard python program, a few libraries need to be installed to allow Python to build the connection to the Hadoop databae.

1.Pyhs2, Python Hive Server 2 Client Driver

2.Sasl, Cyrus-SASL bindings for Python

3.Thrift, Python bindings for the Apache Thrift RPC system

4.PyHive, Python interface to Hive

Remember to change the permission of the executable

chmod +x test_hive2.py ./test_hive2.py

Wish it helps you. Reference: https://sites.google.com/site/tingyusz/home/blogs/hiveinpython

查看更多
戒情不戒烟
3楼-- · 2019-01-10 01:18

I have solved the same problem with you,here is my operation environment( System:linux Versions:python 3.6 Package:Pyhive) please refer to my answer as follows:

from pyhive import hive
conn = hive.Connection(host='149.129.***.**', port=10000, username='*', database='*',password="*",auth='LDAP')

The key point is to add the reference password & auth and meanwhile set the auth equal to 'LDAP' . Then it works well, any questions please let me know

查看更多
对你真心纯属浪费
4楼-- · 2019-01-10 01:22

I believe the easiest way is to use PyHive.

To install you'll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install in Terminal)

After installation, you can connect to Hive like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
查看更多
我命由我不由天
5楼-- · 2019-01-10 01:22

It is a common practice to prohibit user to download and install packages and libraries on cluster nodes. In this case solutions of @python-starter and @goks are working perfect, if hive run on the same node. Otherwise, one can use a beeline instead of hive command line tool. See details

#python 2
import commands

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

.

#python 3
import subprocess

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = subprocess.getstatusoutput(cmd)

if status == 0:
   print(output)
else:
   print("error")
查看更多
家丑人穷心不美
6楼-- · 2019-01-10 01:22

By using Python Client Driver

pip install pyhs2

Then

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
with conn.cursor() as cur:
    #Show databases
    print cur.getDatabases()

    #Execute query
    cur.execute("select * from table")

    #Return column info from query
    print cur.getSchema()

    #Fetch table results
    for i in cur.fetch():
        print i

Refer : https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

查看更多
走好不送
7楼-- · 2019-01-10 01:23

Similar to eycheu's solution, but a little more detailed.

Here is an alternative solution specifically for hive2 that does not require PyHive or installing system-wide packages. I am working on a linux environment that I do not have root access to so installing the SASL dependencies as mentioned in Tristin's post was not an option for me:

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution.

Specifically, this solution focuses on leveraging the python package: JayDeBeApi. In my experience installing this one extra package on top of a python Anaconda 2.7 install was all I needed. This package leverages java (JDK). I am assuming that is already set up.

Step 1: Install JayDeBeApi

pip install jaydebeap

Step 2: Download appropriate drivers for your environment:

  • Here is a link to the jars required for an enterprise CDH environment
  • Another post that talks about where to find jdbc drivers for Apache Hive

Store all .jar files in a directory. I will refer to this directory as /path/to/jar/files/.

Step 3: Identify your systems authentication mechanism:

In the pyhive solutions listed I've seen PLAIN listed as the authentication mechanism as well as Kerberos. Note that your jdbc connection URL will depend on the authentication mechanism you are using. I will explain Kerberos solution without passing a username/password. Here is more information Kerberos authentication and options.

Create a Kerberos ticket if one is not already created

$ kinit

Tickets can be viewed via klist.

You are now ready to make the connection via python:

import jaydebeapi
import glob
# Creates a list of jar files in the /path/to/jar/files/ directory
jar_files = glob.glob('/path/to/jar/files/*.jar')

host='localhost'
port='10000'
database='default'

# note: your driver will depend on your environment and drivers you've
# downloaded in step 2
# this is the driver for my environment (jdbc3, hive2, cloudera enterprise)
driver='com.cloudera.hive.jdbc3.HS2Driver'

conn_hive = jaydebeapi.connect(driver,
        'jdbc:hive2://'+host+':' +port+'/'+database+';AuthMech=1;KrbHostFQDN='+host+';KrbServiceName=hive'
                           ,jars=jar_files)

If you only care about reading, then you can read it directly into a panda's dataframe with ease via eycheu's solution:

import pandas as pd
df = pd.read_sql("select * from table", conn_hive)

Otherwise, here is a more versatile communication option:

cursor = conn_hive.cursor()
sql_expression = "select * from table"
cursor.execute(sql_expression)
results = cursor.fetchall()

You could imagine, if you wanted to create a table, you would not need to "fetch" the results, but could submit a create table query instead.

查看更多
登录 后发表回答