Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?

The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!

标签： python-3.x hdfs parquet

6条回答

狗以群分

2楼-- · 2020-02-28 03:13

Yes pandas supports saving the dataframe in paraquet format.

Simple method to write pandas dataframe to parquet.

Assuming, df is the pandas dataframe. We need to import following libraries.

import pyarrow as pa
import pyarrow.parquet as pq

First, write the datafrmae df into a pyarrow table.

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)

Second, write the table into paraquet file say file_name.paraquet

# Parquet with Brotli compression
pq.write_table(table, 'file_name.paraquet')

NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.

Snappy ( default, requires no argument)
gzip
brotli

Parquet with Snappy compression

 pq.write_table(table, 'file_name.paraquet')

Parquet with GZIP compression

pq.write_table(table, 'file_name.paraquet', compression='GZIP')

Parquet with Brotli compression

pq.write_table(table, 'file_name.paraquet', compression='BROTLI')

Comparative comparision achieved with different formats of paraquet

Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2020-02-28 03:21

There is a relatively early implementation of a package called fastparquet - it could be a good use case for what you need.

https://github.com/dask/fastparquet

conda install -c conda-forge fastparquet

pip install fastparquet

from fastparquet import write 
write('outfile.parq', df)

or, if you want to use some file options, like row grouping/compression:

write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000], compression='GZIP', file_scheme='hive')

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

4楼-- · 2020-02-28 03:33

Yes, it is possible. Here is example code:

import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'output.parquet')

0人赞添加讨论(0) 举报

叛逆

5楼-- · 2020-02-28 03:33

this is the approach that worked for me - similar to the above - but also chose to stipulate the compression type:

import pandas as pd

set up test dataframe

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

import the required parquet library (make sure this has been installed, I used : $ conda install fastparquet)

import fastparquet

convert data frame to parquet and save to current directory

df.to_parquet('df.parquet.gzip', compression='gzip')

read the parquet file in current directory, back into a pandas data frame

pd.read_parquet('df.parquet.gzip')

output:

    col1    col2
0    1       3
1    2       4

0人赞添加讨论(0) 举报

萌系小妹纸

6楼-- · 2020-02-28 03:36

pyarrow has support for storing pandas dataframes:

import pyarrow

pyarrow.Table.from_pandas(dataset)

0人赞添加讨论(0) 举报

一纸荒年 Trace。

7楼-- · 2020-02-28 03:37

Pandas has a core function to_parquet(). Just write the dataframe to parquet format like this:

df.to_parquet('myfile.parquet')

You still need to install a parquet library such as fastparquet. If you have more than one parquet library installed, you also need to specify which engine you want pandas to use, otherwise it will take the first one to be installed (as in the documentation). For example:

df.to_parquet('myfile.parquet', engine='fastparquet')

0人赞添加讨论(0) 举报

Python: save pandas data frame to parquet file

Yes pandas supports saving the dataframe in paraquet format.

Simple method to write pandas dataframe to parquet.

NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.

Comparative comparision achieved with different formats of paraquet

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间