I have some data in HDFS,i need to access that data using python,can anyone tell me how data is accessed from hive using python?
问题:
回答1:
You can use hive library for access hive from python,for that you want to import hive Class from hive import ThriftHive
Below the Example
import sys
from hive import ThriftHive
from hive.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
transport = TSocket.TSocket('localhost', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
client.execute("SELECT * FROM r")
while (1):
row = client.fetchOne()
if (row == None):
break
print row
client.execute("SELECT * FROM r")
print client.fetchAll()
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
回答2:
To install you'll need these libraries:
pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive
If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev
using apt-get
or yum
or whatever package manager. For Windows there are some options on GNU.org. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install
)
After installation, you can execute a hive query like this:
from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")
Now that you have the hive connection, you have options how to use it. You can just straight-up query:
cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
use_result(result)
...or to use the connection to make a Pandas dataframe:
import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
回答3:
A much simpler solution if you're on Windows uses pyodbc
:
import pyodbc
import pandas as pd
# connect odbc to data source name
conn = pyodbc.connect("DSN=<your_dsn>", autocommit=True)
# read data into dataframe
hive_df = pd.read_sql("SELECT * FROM <table_name>", conn)
As long as you have an ODBC driver and a DSN, that's all you need.