Speeding up reading of very large netcdf file in p

2020-02-19 10:47发布

问题:

I have a very large netCDF file that I am reading using netCDF4 in python

I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and the next 2 represent latitude and longitude respectively.

import netCDF4 
nc_file = netCDF4.Dataset(path_file, 'r', format='NETCDF4')
for yr in years:
    nc_file.variables[variable_name][int(yr), :, :]

However, reading one year at a time is excruciatingly slow. How do I speed this up for the use cases below?

--EDIT

The chunksize is 1

  1. I can read a range of years: nc_file.variables[variable_name][0:100, :, :]

  2. There are several use-cases:

    for yr in years:

    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :])
    

# Multiply each year by a 2D array of shape (720 x 1440)
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] * arr_2d)

# Add 2 netcdf files together 
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] + 
                 nc_file2.variables[variable_name][int(yr), :, :])

回答1:

I highly recommend that you take a look at the xarray and dask projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).

You can load your data from netCDF by doing something as simple as

import xarray as xr
ds = xr.open_dataset(path_file)

If you want to chunk your data in years along the time dimension, then you specify the chunks parameter (assuming that the year coordinate is named 'year'):

ds = xr.open_dataset(path_file, chunks={'year': 10})

Since the other coordinates do not appear in the chunks dict, then a single chunk will be used for them. (See more details in the documentation here.). This will be useful for your first requirement, where you want to multiply each year by a 2D array. You would simply do:

ds['new_var'] = ds['var_name'] * arr_2d

Now, xarray and dask are computing your result lazily. In order to trigger the actual computation, you can simply ask xarray to save your result back to netCDF:

ds.to_netcdf(new_file)

The computation gets triggered through dask, which takes care of splitting the processing out in chunks and thus enables working with data that does not fit in memory. In addition, dask will take care of using all your processor cores for computing chunks.

The xarray and dask projects still don't handle nicely situations where chunks do not "align" well for parallel computation. Since in this case we chunked only in the 'year' dimension, we expect to have no issues.

If you want to add two different netCDF files together, it is as simple as:

ds1 = xr.open_dataset(path_file1, chunks={'year': 10})
ds2 = xr.open_dataset(path_file2, chunks={'year': 10})
(ds1 + ds2).to_netcdf(new_file)

I have provided a fully working example using a dataset available online.

In [1]:

import xarray as xr
import numpy as np

# Load sample data and strip out most of it:
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks = {'time': 4})
ds.attrs = {}
ds = ds[['latitude', 'longitude', 'time', 'tcw']]
ds

Out[1]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...

In [2]:

arr2d = np.ones((73, 144)) * 3.
arr2d.shape

Out[2]:

(73, 144)

In [3]:

myds = ds
myds['new_var'] = ds['tcw'] * arr2d

In [4]:

myds

Out[4]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [5]:

myds.to_netcdf('myds.nc')
xr.open_dataset('myds.nc')

Out[5]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [6]:

(myds + myds).to_netcdf('myds2.nc')
xr.open_dataset('myds2.nc')

Out[6]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
Data variables:
    tcw        (time, latitude, longitude) float64 20.31 20.31 20.31 20.31 ...
    new_var    (time, latitude, longitude) float64 60.92 60.92 60.92 60.92 ...


回答2:

Check chunking of file. ncdump -s <infile> will give the answer. If chunk size in time dimension is larger than one, You should read the same amount of years at once, otherwise You are reading several years at once from disk and using only one at a time. How slow is slow? Max few seconds per timestep sounds reasonable for an array of this size. Giving more info on what You do with the data later may give us more guidance on where the problem may be.



回答3:

This is Kinda Hacky, but may be the simplest solution:

Read subsets of the file into memory, then cPickle (https://docs.python.org/3/library/pickle.html) the file back to disk for future use. Loading your data from a pickled data structure is likely to be faster than parsing netCDF every time.