I'm reading in a NetCDF file and I want to read in each array as a float array and then write the float array to a new file. I can make it work if I read in the float array and then iterate over each element in the array (using a DataOutputStream), but this is very, very slow, my NetCDF files are over 1GB.
I tried using an ObjectOutputStream, but this writes extra bytes of information.
So, to recap.
1. Open NetCDF file
2. Read float array x from NetCDF file
3. Write float array x to raw data file in a single step
4. Repeat step 2 with x+1
Ok, You have 1 GB to read and 1 GB to write. Depending on your hard drive, you might get about 100 MB/s read and 60 MB/s write speed. This means it will take about 27 seconds to read and write.
What is the speed of your drive and how much slower than this are you seeing?
If you want to test the speed of your disk without any processing, time how long it takes to copy a file which you haven't accessed recently (i.e. it is not in disk cache) This will give you an idea of the minimum delay you can expect to read then write most of the data from the file (i.e. with no processing or Java involved)
For the benefit of anyone who would like to know how to do a loop less copy of data i.e. it doesn't just call a method which loops for you.
FloatBuffer src = // readable memory mapped file.
FloatByffer dest = // writeable memory mapped file.
src.position(start);
src.limit(end);
dest.put(src);
If you have mixed types of data you can use ByteBuffer which notionally copies a byte at a time but in reality could use long or wider type to copy 8 or more bytes at a time. i.e. whatever the CPU can do.
For small blocks this will use a loop but for large blocks it can use page mapping tricks in the OS. In any case, how it does it is not defined in Java, but its likely to be the fastest way to copy data.
Most of these tricks only make a difference if you are copying file already in memory to a cached file. As soon as you read a file from disk or the file is too large to cache the IO bandwidth of the your physical disk is the only thing which really matters.
This is because a CPU can copy data at 6 GB/s to main memory but only 60-100 MB/s to a hard drive. If the copy in the CPU/memory is 2x, 10x or 50x slower than it could be, it will still be waiting for the disk. Note: with no buffering this is entirely possible and worse, but provided you have any simple buffering the CPU will be faster than the disk.
1) when writing, use BufferedOutputStream, you will get a factor of 100 speedup.
2) when reading, read at least 10K per read, probably 100K is better.
3) post your code.
I ran into the same problem and will dump my solution here just for future refrerence.
It is very slow to iterate over an array of floats and calling DataOutputStream.writeFloat for each of them. Instead, transform the floats yourself into a byte array and write that array all at once:
Slow:
DataOutputStream out = ...;
for (int i=0; i<floatarray.length; ++i)
out.writeFloat(floatarray[i]);
Much faster
DataOutputStream out = ...;
byte buf[] = new byte[4*floatarray.length];
for (int i=0; i<floatarray.length; ++i)
{
int val = Float.floatToRawIntBits(probs[i]);
buf[4 * i] = (byte) (val >> 24);
buf[4 * i + 1] = (byte) (val >> 16) ;
buf[4 * i + 2] = (byte) (val >> 8);
buf[4 * i + 3] = (byte) (val);
}
out.write(buf);
If your array is very large (>100k), break it up into chunks to avoid heap overflow with the buffer array.
If you are using the Unidata NetCDF library your problem may not be the writing, but rather the NetCDF libraries caching mechanism.
NetcdfFile file = NetcdfFile.open(filename);
Variable variable = openFile.findVariable(variable name);
for (...) {
read data
variable.invalidateCache();
}
Lateral solution:
If this is a one-off generation (or if you are willing to automate it in an Ant script) and you have access to some kind of Unix environment, you can use NCDUMP instead of doing it in Java. Something like:
ncdump -v your_variable your_file.nc | [awk] > float_array.txt
You can control the precision of the floats with the -p option if you desire. I just ran it on a 3GB NetCDF file and it worked fine. As much as I love Java, this is probably the quickest way to do what you want.