Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?
The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!
Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process?
The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!
Yes pandas supports saving the dataframe in paraquet format.
Simple method to write pandas dataframe to parquet.
Assuming,
df
is the pandas dataframe. We need to import following libraries.First, write the datafrmae
df
into apyarrow
table.Second, write the
table
intoparaquet
file sayfile_name.paraquet
NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.
Parquet with Snappy compression
Parquet with GZIP compression
Parquet with Brotli compression
Comparative comparision achieved with different formats of paraquet
Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/
There is a relatively early implementation of a package called fastparquet - it could be a good use case for what you need.
https://github.com/dask/fastparquet
or
or, if you want to use some file options, like row grouping/compression:
Yes, it is possible. Here is example code:
this is the approach that worked for me - similar to the above - but also chose to stipulate the compression type:
set up test dataframe
import the required parquet library (make sure this has been installed, I used :
$ conda install fastparquet
)convert data frame to parquet and save to current directory
read the parquet file in current directory, back into a pandas data frame
output:
pyarrow has support for storing pandas dataframes:
Pandas has a core function
to_parquet()
. Just write the dataframe to parquet format like this:You still need to install a parquet library such as
fastparquet
. If you have more than one parquet library installed, you also need to specify which engine you want pandas to use, otherwise it will take the first one to be installed (as in the documentation). For example: