I looking for ways to read data from multiple partitioned directories from s3 using python.
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow's ParquetDataset module has the capabilty to read from partitions. So I have tried the following code :
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
It threw the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
Based on documentation of pyarrow I tried using s3fs as the file system, ie :
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
Which throws the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest
if is_string(path_or_paths) and fs.isdir(path_or_paths):
AttributeError: module 's3fs' has no attribute 'isdir'
I am limited to use a ECS cluster, hence spark/pyspark is not an option.
Is there a way we can easily read the parquet files easily, in python from such partitioned directories in s3 ? I feel that listing the all the directories and then reading the is not a good practise as suggested in this link. I would need to convert the read data to a pandas dataframe for further processing & hence prefer options related to fastparquet or pyarrow. I am open to other options in python as well.
Let's discuss in https://issues.apache.org/jira/browse/ARROW-1213 and https://issues.apache.org/jira/browse/ARROW-1119. We must add some code to allow pyarrow to recognize the s3fs filesystem and add a shim / compatibility class to conform S3FS's slightly different filesystem API to pyarrow's.
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
Example:
I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same:
credits to martin for pointing me in the right direction via our conversation
NB : This would be slower than using pyarrow, based on the benchmark . I will update my answer once s3fs support is implemented in pyarrow via ARROW-1213
I did quick benchmark on on indivdual iterations with pyarrow & list of files send as a glob to fastparquet. fastparquet is faster with s3fs vs pyarrow + my hackish code. But I reckon pyarrow +s3fs will be faster once implemented.
The code & benchmarks are below :
Update 2019
After all PRs, Issues such as Arrow-2038 & Fast Parquet - PR#182 have been resolved.
Read parquet files using Pyarrow
Read parquet files using Fast parquet
Quick benchmarks
This is probably not the best way to benchmark it. please read the blog post for a through benchmark
Further reading regarding Pyarrow's speed
Reference :
For those of you who want to read in only parts of a partitioned parquet file, pyarrow accepts a list of keys as well as just the partial directory path to read in all parts of the partition. This method is especially useful for organizations who have partitioned their parquet datasets in a meaningful like for example by year or country allowing users to specify which parts of the file they need. This will reduce costs in the long run as AWS charges per byte when reading in datasets.
This issue was resolved in this pull request in 2017.
For those who want to read parquet from S3 using only pyarrow, here is an example: