Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 file per day), the second and third spatial dimensions. Currently, making a slice over the first dimension (even a partial slice), would take a lot of time because of the following reasons:
- the netCDF files are chunked with a chunksize of 1x3712x3712. Slicing over the time dimension basically would read the entire file.
- looping (even in multiple processes) over all of the smaller files would take a lot of time as well.
My goal:
- create monthly files (about 2900x3712x3712) data points
- optimize them for slicing in time dimension (chunksize of 2900x1x1 or slightly bigger in the spatial dimensions)
Other requirements:
- the files should be appendable by a single timestamp (1x3712x3712), and this update process should take less than 15min
- the query should be fast enough: a full slice over time in less than one second (that is 2900x1x1)==> not so much data in fact...
- preferably the files should be accessible for read by multiple processes while being updated
- processing the historical data (the other 5000 daily files) should take less than a couple of weeks preferably.
I tried already multiple approaches:
- concatenating netcdf files and rechunking them==> takes too much memory and too much time...
- writing them from pandas to an hdf file (using pytables)==> creates a wide table with a huge index. This will eventually take too much time as well to read and requires the dataset to be tiled over the spatial dimensions due to meta data constraints.
- my last approach was writing them to an hdf5 file using h5py:
Here's the code to create a single monthly file:
import h5py
import pandas as pd
import numpy as np
def create_h5(fps):
timestamps=pd.date_range("20050101",periods=31*96,freq='15T') #Reference time period
output_fp = r'/data/test.h5'
try:
f = h5py.File(output_fp, 'a',libver='latest')
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape=(1,3712,3712), maxshape=(None,3712,3712),dtype='f', compression='gzip', compression_opts=9,chunks=(1,29,29))
f.swmr_mode = True
for fp in fps:
try:
nc=Dataset(fp)
times = num2date(nc.variables['time'][:], nc.variables['time'].units)
indices=np.searchsorted(timestamps, times)
for j,time in enumerate(times):
logger.debug("File: {}, timestamp: {:%Y%m%d %H:%M}, pos: {}, new_pos: {}".format(os.path.basename(fp),time,j,indices[j]))
d.resize((indices[j]+1,shape[1],shape[2]))
d[indices[j]]=nc.variables['variable'][j:j+1]
f.flush()
finally:
nc.close()
finally:
f.close()
return output_fp
I'm using the last version of HDF5 to have the SWMR option. The fps argument is a list of file paths of the daily netCDF4 files. It creates the file (on an ssd, but I see that creating the file is mainly CPU bound) in about 2 hours, which is acceptable.
I have compression set up to keep the file size within limits. I did earlier tests without, and saw that the creation without is a bit faster but the slicing takes not so much longer with compression. H5py automatically chunks the dataset in 1x116x116 chunks.
Now the problem: slicing on a NAS with RAID 6 setup, takes about 20seconds to slice the time dimension, even though it is in a single chunk...
I figure, that even though it is in a single chunk in the file, because I wrote all of the values in a loop, it must be fragmented some how (don't know how this process works though). This is why I tried to do a h5repack using the CML tools of HDF5 into a new file, with the same chunks but hopefully reordering the values so that the query is able to read the values in a more sequential order, but no luck. Even though this process took 6h to run, it didn't do a thing on the query speed.
If I do my calculations right, reading one chunk (2976x32x32) is only a few MB big (11MB uncompressed, only a bit more than 1MB compressed I think). How can this take so long? What am I doing wrong? I would be glad if someone can shine a light on what is actually going on behind the scenes...