In Python, how do I read in a binary file and loop over each byte of that file?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
This generator yields bytes from a file, reading the file in chunks:
See the Python documentation for information on iterators and generators.
Python 2.4 and Earlier
Python 2.5-2.7
Note that the with statement is not available in versions of Python below 2.5. To use it in v 2.5 you'll need to import it:
In 2.6 this is not needed.
Python 3
In Python 3, it's a bit different. We will no longer get raw characters from the stream in byte mode but byte objects, thus we need to alter the condition:
Or as benhoyt says, skip the not equal and take advantage of the fact that
b""
evaluates to false. This makes the code compatible between 2.6 and 3.x without any changes. It would also save you from changing the condition if you go from byte mode to text or the reverse.If you have a lot of binary data to read, you might want to consider the struct module. It is documented as converting "between C and Python types", but of course, bytes are bytes, and whether those were created as C types does not matter. For example, if your binary data contains two 2-byte integers and one 4-byte integer, you can read them as follows (example taken from
struct
documentation):You might find this more convenient, faster, or both, than explicitly looping over the content of a file.
New in Python 3.5 is the
pathlib
module, which has a convenience method specifically to read in a file as bytes, allowing us to iterate over the bytes. I consider this a decent (if quick and dirty) answer:Interesting that this is the only answer to mention
pathlib
.In Python 2, you probably would do this (as Vinay Sajip also suggests):
In the case that the file may be too large to iterate over in-memory, you would chunk it, idiomatically, using the
iter
function with thecallable, sentinel
signature - the Python 2 version:(Several other answers mention this, but few offer a sensible read size.)
Best practice for large files or buffered/interactive reading
Let's create a function to do this, including idiomatic uses of the standard library for Python 3.5+:
Note that we use
file.read1
.file.read
blocks until it gets all the bytes requested of it orEOF
.file.read1
allows us to avoid blocking, and it can return more quickly because of this. No other answers mention this as well.Demonstration of best practice usage:
Let's make a file with a megabyte (actually mebibyte) of pseudorandom data:
Now let's iterate over it and materialize it in memory:
We can inspect any part of the data, for example, the last 100 and first 100 bytes:
Don't iterate by lines for binary files
Don't do the following - this pulls a chunk of arbitrary size until it gets to a newline character - too slow when the chunks are too small, and possibly too large as well:
The above is only good for what are semantically human readable text files (like plain text, code, markup, markdown etc... essentially anything ascii, utf, latin, etc... encoded).
To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument
iter(callable, sentinel)
built-in function:It calls
file.read(1)
until it returns nothingb''
(empty bytestring). The memory doesn't grow unlimited for large files. You could passbuffering=0
toopen()
, to disable the buffering — it guarantees that only one byte is read per iteration (slow).with
-statement closes the file automatically — including the case when the code underneath raises an exception.Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here's the
blackhole.py
utility that eats everything it is given:Example:
It processes ~1.5 GB/s when
chunksize == 32768
on my machine and only ~7.5 MB/s whenchunksize == 1
. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.mmap
allows you to treat a file as abytearray
and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plainfor
-loop:mmap
supports the slice notation. For example,mm[i:i+len]
returnslen
bytes from the file starting at positioni
. The context manager protocol is not supported before Python 3.2; you need to callmm.close()
explicitly in this case. Iterating over each byte usingmmap
consumes more memory thanfile.read(1)
, butmmap
is an order of magnitude faster.If the file is not too big that holding it in memory is a problem:
where process_byte represents some operation you want to perform on the passed-in byte.
If you want to process a chunk at a time: