What is the fastest way to read a specific chunk o

2019-08-18 18:09发布

问题:

I have a sensor unit which generates data in large binary files. File sizes can run into several tens of Gigabytes. I need to:

  1. Read the data.
  2. Process it to extract the necessary information that I want.
  3. Display / Visualize the data.

Data in the binary file is formatted as: Single precision float i.e. numpy.float32

I have written the code which is working well. I am now looking to optimize it for time. I observe that it is taking a very large time to read the binary data. The following is what I have right now :

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

This allows me to iterate the value for n and obtain different traces i.e. chunks of data. The variable no_of_points_per_trace contains the number of points in every trace as the name suggests. I am obtaining this from a separate .info file.

Is there an optimal way to do this?

回答1:

Right now you are reading the whole file into memory when you do np.fromfile(fid, np.float32). If that fits and you want to access a significant number of traces (if you're calling your function with lots of different values for n), your only big speedup is to avoid reading it multiple times. So perhaps you might want to read the whole file and then have your function just index into that:

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

Now, if you find yourself needing only one or two traces from the big file, you can seek into it and just read the part you want:

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

You will notice I have skipped converting to list. This is a slow step and probably not required for your workflow.