Inserting a Python Dataframe into Hive from an ext

I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.

After performing analysis I would like to write data back to the Hive server. In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?

Thanks for your help!

标签： python hive sqlalchemy pyhive

4条回答

Rolldiameter

2楼-- · 2019-08-26 08:10

You can write back. Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on; thus you can insert.

bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]

con.cursor().execute('insert into table table_name values'+ bundle)

and you are done.

0人赞添加讨论(0) 举报

走好不送

3楼-- · 2019-08-26 08:12

You could use the subprocess module.

The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.

Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.

  import os
  from subprocess import PIPE, Popen

  def save_to_hdfs(output):
      """
      Save a file in local scope to hdfs.
      Note, this performs a forced put - any file with the same name will be 
      overwritten.
      """
      hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
      put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
      put.communicate()

  # example
  df = pd.DataFrame(...)
  output_file = 'yourdata.csv'
  dataframe.to_csv(output_file)
  save_to_hdfs(output_file)
  # remove locally created file (so it doesn't pollute nodes)
  os.remove(output_file)

0人赞添加讨论(0) 举报

forever°为你锁心

4楼-- · 2019-08-26 08:21

In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format? Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by @Jared Wilber below) to move that file into hdfs hive table location path.

When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.

Some helpful serde libraries in python:

Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/

0人赞添加讨论(0) 举报

再贱就再见

5楼-- · 2019-08-26 08:25

It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.

from sqlalchemy import create_engine

#Input Information
host = 'username@local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'


#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')

0人赞添加讨论(0) 举报

Inserting a Python Dataframe into Hive from an ext

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间