I am trying to read parquet files using thedask
read_parquet
method and the filters
kwarg. however it sometimes doesn't filter according to the given condition.
Example:
creating and saving data frame with a dates
column
import pandas as pd
import numpy as np
import dask.dataframe as dd
nums = range(1,6)
dates = pd.date_range('2018-07-01', periods=5, freq='1d')
df = pd.DataFrame({'dates':dates, 'nums': nums})
ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet')
when i read and filter on the dates
column from the 'test_par'
folder it doesn't seem to work
filters=[('dates', '>', np.datetime64('2018-07-04'))]
df = dd.read_parquet('test_par', engine='fastparquet', filters=filters).compute()
as you can see in the output, 2018-07-03
and 2018-07-04
are present.
+-------+------------+------+
| | dates | nums |
+-------+------------+------+
| index | | |
+-------+------------+------+
| 2 | 2018-07-03 | 3 |
+-------+------------+------+
| 3 | 2018-07-04 | 4 |
+-------+------------+------+
| 4 | 2018-07-05 | 5 |
+-------+------------+------+
Am i doing something wrong ? or should i report this on github ?