Hive queries via Python client

2019-07-16 13:21发布

I have hive 0.8 installed on a hadoop cluster running in AWS EMR.

I am trying to do some data QA, which involves running a hive query and fetching the results into python where some more logic is contained.

Currently, this is achieved by sending a hive query as a jobflow step, dumping those results to local storage on the master node, SCP-ing those results to my local machine, and then loading the file with python and parsing the results. All in all, not a very fun process.

Ideally, I would be able to do this in a fashion similar to:

conn = hive.connect(ip, port, user, pw)
cursor = conn.cursor()
cursor.execute(query)
rs = cursor.fetchall()

It seems that this is supposedly possible. Hive says that it supports it here. There is also another SO question that looks like it's doing what I'd like to do.

However, I'm having trouble finding documentation. In particular, I haven't been able to figure out where to obtain the packages used in these examples. It would be immensely helpful if anyone were able to provide detailed instructions as to how to get the python client working, but failing that, it would be helpful just to know where to obtain these packages.

2条回答
叼着烟拽天下
2楼-- · 2019-07-16 14:08

Looks like the hive_utils package has what you're looking for. Looking at the pypi page, you can run queries in the following way:

query = """
    SELECT country, count(1) AS cnt
    FROM User
    GROUP BY country
"""
hive_client = hive_utils.HiveClient(
    server=config['HOST'],
    port=config['PORT'],
    db=config['NAME'],
)
for row in hive_client.execute(query):
    print '%s: %s' % (row['country'], row['cnt'])

Installing that should also install the needed thrift packages.

查看更多
相关推荐>>
3楼-- · 2019-07-16 14:18

If you build hive from source, the modules will be located here (relative to the hive-trunk directory):

./build/dist/lib/py

You should be able to access the modules if you include that path in your PYTHONPATH environment variable, or you add that path to your python path in your script with the sys module.

Also note that there is no longer a module named 'hive'. In the example code you linked 'hive' should be replaced with 'hive_service'.

查看更多
登录 后发表回答