什么是从Python中的大型二进制文件中读取数据块特定的最快方法(What is the faste

我有一个传感器单元，其在大的二进制文件生成的数据。文件大小可能碰上几十千兆字节的。我需要：

读取数据。
处理它来提取我想要的必要信息。
显示/可视化的数据。

在二进制文件中的数据被格式化为：单精度浮点数即numpy.float32

我写这是运作良好的代码。我现在正在优化它的时间。我观察到，它走的是一条非常大的时间来读取二进制数据。下面就是我现在所拥有的：

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

这让我重复n的值，并获得不同的痕迹，即数据块。变量no_of_points_per_trace包含的点数在每一丝顾名思义。我从一个单独的.info文件获得此。

有没有做到这一点的最佳方式是什么？

现在你正在阅读的整个文件到内存中，当你做np.fromfile(fid, np.float32) 如果适合，你要访问的痕迹的显著号码（如果你打电话，有很多不同的值，为您的功能n ），你只有大的提速是避免多次阅读它。因此，也许你可能想读取整个文件，然后你的函数只是索引是：

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

现在，如果你发现自己需要一个或两个痕迹从大的文件，你可以寻找到它，只是看到你想要的部分：

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

你会发现我已经跳过转换上市。这是一个缓慢的步骤，可能不需要您的工作流程。