I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile { dimensions: time_counter = 18 ; depth = 50 ; latitude = 361 ; longitude = 601 ; variables: salinity temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in
ncdump
formatting the file information into textual data, and inncgen
parsing textual data into a NetCDF file format again.As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the
_netcdf_dim_info
dataset in group_netCDF
(or so the documentation tells me).It may be possible to modify the information there to turn the current length of the
time_counter
dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the
xarray
python package'sxr.to_netdf()
method, then optimise memory usage via usingDask
.You just need to pass names of the dimensions to make unlimited to the
unlimited_dims
argument and use thechunks
to split the data. For instance:There is a nice summary of combining
Dask
andxarray
linked here.Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
To do the opposite operation (record dimension to fixed-size) would be.