I have a very large netCDF file that I am reading using netCDF4 in python
I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and the next 2 represent latitude and longitude respectively.
import netCDF4
nc_file = netCDF4.Dataset(path_file, 'r', format='NETCDF4')
for yr in years:
nc_file.variables[variable_name][int(yr), :, :]
However, reading one year at a time is excruciatingly slow. How do I speed this up for the use cases below?
--EDIT
The chunksize is 1
I can read a range of years: nc_file.variables[variable_name][0:100, :, :]
There are several use-cases:
for yr in years:
numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :])
# Multiply each year by a 2D array of shape (720 x 1440)
for yr in years:
numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] * arr_2d)
# Add 2 netcdf files together
for yr in years:
numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] +
nc_file2.variables[variable_name][int(yr), :, :])
I highly recommend that you take a look at the xarray
and dask
projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).
You can load your data from netCDF by doing something as simple as
import xarray as xr
ds = xr.open_dataset(path_file)
If you want to chunk your data in years along the time dimension, then you specify the chunks
parameter (assuming that the year coordinate is named 'year'):
ds = xr.open_dataset(path_file, chunks={'year': 10})
Since the other coordinates do not appear in the chunks
dict, then a single chunk will be used for them. (See more details in the documentation here.). This will be useful for your first requirement, where you want to multiply each year by a 2D array. You would simply do:
ds['new_var'] = ds['var_name'] * arr_2d
Now, xarray
and dask
are computing your result lazily. In order to trigger the actual computation, you can simply ask xarray
to save your result back to netCDF:
ds.to_netcdf(new_file)
The computation gets triggered through dask
, which takes care of splitting the processing out in chunks and thus enables working with data that does not fit in memory. In addition, dask
will take care of using all your processor cores for computing chunks.
The xarray
and dask
projects still don't handle nicely situations where chunks do not "align" well for parallel computation. Since in this case we chunked only in the 'year' dimension, we expect to have no issues.
If you want to add two different netCDF files together, it is as simple as:
ds1 = xr.open_dataset(path_file1, chunks={'year': 10})
ds2 = xr.open_dataset(path_file2, chunks={'year': 10})
(ds1 + ds2).to_netcdf(new_file)
I have provided a fully working example using a dataset available online.
In [1]:
import xarray as xr
import numpy as np
# Load sample data and strip out most of it:
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks = {'time': 4})
ds.attrs = {}
ds = ds[['latitude', 'longitude', 'time', 'tcw']]
ds
Out[1]:
<xarray.Dataset>
Dimensions: (latitude: 73, longitude: 144, time: 62)
Coordinates:
* latitude (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
* longitude (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
* time (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
tcw (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
In [2]:
arr2d = np.ones((73, 144)) * 3.
arr2d.shape
Out[2]:
(73, 144)
In [3]:
myds = ds
myds['new_var'] = ds['tcw'] * arr2d
In [4]:
myds
Out[4]:
<xarray.Dataset>
Dimensions: (latitude: 73, longitude: 144, time: 62)
Coordinates:
* latitude (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
* longitude (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
* time (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
tcw (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
new_var (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...
In [5]:
myds.to_netcdf('myds.nc')
xr.open_dataset('myds.nc')
Out[5]:
<xarray.Dataset>
Dimensions: (latitude: 73, longitude: 144, time: 62)
Coordinates:
* latitude (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
* longitude (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
* time (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
tcw (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
new_var (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...
In [6]:
(myds + myds).to_netcdf('myds2.nc')
xr.open_dataset('myds2.nc')
Out[6]:
<xarray.Dataset>
Dimensions: (latitude: 73, longitude: 144, time: 62)
Coordinates:
* time (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
* latitude (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
* longitude (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
Data variables:
tcw (time, latitude, longitude) float64 20.31 20.31 20.31 20.31 ...
new_var (time, latitude, longitude) float64 60.92 60.92 60.92 60.92 ...
Check chunking of file. ncdump -s <infile>
will give the answer. If chunk size in time dimension is larger than one, You should read the same amount of years at once, otherwise You are reading several years at once from disk and using only one at a time.
How slow is slow? Max few seconds per timestep sounds reasonable for an array of this size.
Giving more info on what You do with the data later may give us more guidance on where the problem may be.
This is Kinda Hacky, but may be the simplest solution:
Read subsets of the file into memory, then cPickle (https://docs.python.org/3/library/pickle.html) the file back to disk for future use. Loading your data from a pickled data structure is likely to be faster than parsing netCDF every time.