Python file iterator over a binary file with newer

2019-01-11 09:57发布

问题:

In Python, for a binary file, I can write this:

buf_size=1024*64           # this is an important size...
with open(file, "rb") as f:
   while True:
      data=f.read(buf_size)
      if not data: break
      # deal with the data....

With a text file that I want to read line-by-line, I can write this:

with open(file, "r") as file:
   for line in file:
       # deal with each line....

Which is shorthand for:

with open(file, "r") as file:
   for line in iter(file.readline, ""):
       # deal with each line....

This idiom is documented in PEP 234 but I have failed to locate a similar idiom for binary files.

I have tried this:

>>> with open('dups.txt','rb') as f:
...    for chunk in iter(f.read,''):
...       i+=1

>>> i
1                # 30 MB file, i==1 means read in one go...

I tried putting iter(f.read(buf_size),'') but that is a syntax error because of the parens after the callable in iter().

I know I could write a function, but is there way with the default idiom of for chunk in file: where I can use a buffer size versus a line oriented?

Thanks for putting up with the Python newbie trying to write his first non-trivial and idiomatic Python script.

回答1:

I don't know of any built-in way to do this, but a wrapper function is easy enough to write:

def read_in_chunks(infile, chunk_size=1024*64):
    while True:
        chunk = infile.read(chunk_size)
        if chunk:
            yield chunk
        else:
            # The chunk was empty, which means we're at the end
            # of the file
            return

Then at the interactive prompt:

>>> from chunks import read_in_chunks
>>> infile = open('quicklisp.lisp')
>>> for chunk in read_in_chunks(infile):
...     print chunk
... 
<contents of quicklisp.lisp in chunks>

Of course, you can easily adapt this to use a with block:

with open('quicklisp.lisp') as infile:
    for chunk in read_in_chunks(infile):
        print chunk

And you can eliminate the if statement like this.

def read_in_chunks(infile, chunk_size=1024*64):
    chunk = infile.read(chunk_size)
    while chunk:
        yield chunk
        chunk = infile.read(chunk_size)


回答2:

Try:

>>> with open('dups.txt','rb') as f:
...    for chunk in iter((lambda:f.read(how_many_bytes_you_want_each_time)),''):
...       i+=1

iter needs a function with zero arguments.

  • a plain f.read would read the whole file, since the size parameter is missing;
  • f.read(1024) means call a function and pass its return value (data loaded from file) to iter, so iter does not get a function at all;
  • (lambda:f.read(1234)) is a function that takes zero arguments (nothing between lambda and :) and calls f.read(1234).

There is equivalence between following:

somefunction = (lambda:f.read(how_many_bytes_you_want_each_time))

and

def somefunction(): return f.read(how_many_bytes_you_want_each_time)

and having one of these before your code you could just write: iter(somefunction, '').

Technically you can skip the parentheses around lambda, python's grammar will accept that.