Read binary flatfile and skip bytes

2020-04-18 08:41发布

I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32 from bytes at position 304 to position 308. However, I cannot find a method provided by NumPy that lets me select which bytes to read, only an initial offset as defined in numpy.fromfile.

For example, if my file contains 1000 groups of 400 bytes, I need an array of size 1000 such that:

arr[0] = bytes 304-308
arr[1] = bytes 704-708
...
arr[-1] = bytes 399904 - 399908

Is there a NumPy method that would allow me to specify which bytes to read from a buffer?

标签: python numpy
1条回答
够拽才男人
2楼-- · 2020-04-18 08:50

Another way to rephrase what you are looking for (slightly), is to say you want to read uint32 numbers starting at offset 304, with a stride of 400 bytes. np.fromfile does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.

The simplest is probably to load the entire file and subset the column you want:

data = np.fromfile(filename, dtype=np.uint32)[304 // 4::400 // 4].copy()

If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:

dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()

Here, _1 and _2 are used to discard the unneeded bytes with 1-byte resolution rather than 4.

Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.

Memory maps can be implemented via Pythons mmap module, and wrapped in an ndarray using the buffer parameter, or you can use the np.memmap class that does it for you:

mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400 // 4))
data = np.array(mm[:, 304 // 4])
del mm

Using a raw mmap is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32:

with open(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
    data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()

The final call to copy is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.

查看更多
登录 后发表回答