I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server. In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You can write back. Convert data of df into such format like you are inserting multiple rows into the table at once eg..
insert into table values (first row of dataframe comma separated ), (second row), (third row)
.... so on; thus you can insert.and you are done.
You could use the
subprocess
module.The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into
save_to_hdfs
, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.Here's an example function for saving a local object,
output
, touser/<your_name>/<output_name>
in hdfs.In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format? Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by @Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.