I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You could use the subprocess
module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs
, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output
, to user/<your_name>/<output_name>
in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)
In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by @Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
- Parquet: https://fastparquet.readthedocs.io/en/latest/
- Avro:https://pypi.org/project/fastavro/
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username@local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')
You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row)
.... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.