I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32
from bytes at position 304 to position 308. However, I cannot find a method provided by NumPy that lets me select which bytes to read, only an initial offset as defined in numpy.fromfile
.
For example, if my file contains 1000 groups of 400 bytes, I need an array of size 1000 such that:
arr[0] = bytes 304-308
arr[1] = bytes 704-708
...
arr[-1] = bytes 399904 - 399908
Is there a NumPy method that would allow me to specify which bytes to read from a buffer?
Another way to rephrase what you are looking for (slightly), is to say you want to read
uint32
numbers starting at offset 304, with a stride of 400 bytes.np.fromfile
does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.The simplest is probably to load the entire file and subset the column you want:
If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:
Here,
_1
and_2
are used to discard the unneeded bytes with 1-byte resolution rather than 4.Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.
Memory maps can be implemented via Pythons
mmap
module, and wrapped in anndarray
using thebuffer
parameter, or you can use thenp.memmap
class that does it for you:Using a raw
mmap
is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of anp.uint32
:The final call to
copy
is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.