In Python, for a binary file, I can write this:
buf_size=1024*64 # this is an important size...
with open(file, "rb") as f:
while True:
data=f.read(buf_size)
if not data: break
# deal with the data....
With a text file that I want to read line-by-line, I can write this:
with open(file, "r") as file:
for line in file:
# deal with each line....
Which is shorthand for:
with open(file, "r") as file:
for line in iter(file.readline, ""):
# deal with each line....
This idiom is documented in PEP 234 but I have failed to locate a similar idiom for binary files.
I have tried this:
>>> with open('dups.txt','rb') as f:
... for chunk in iter(f.read,''):
... i+=1
>>> i
1 # 30 MB file, i==1 means read in one go...
I tried putting iter(f.read(buf_size),'')
but that is a syntax error because of the parens after the callable in iter().
I know I could write a function, but is there way with the default idiom of for chunk in file:
where I can use a buffer size versus a line oriented?
Thanks for putting up with the Python newbie trying to write his first non-trivial and idiomatic Python script.
I don't know of any built-in way to do this, but a wrapper function is easy enough to write:
def read_in_chunks(infile, chunk_size=1024*64):
while True:
chunk = infile.read(chunk_size)
if chunk:
yield chunk
else:
# The chunk was empty, which means we're at the end
# of the file
return
Then at the interactive prompt:
>>> from chunks import read_in_chunks
>>> infile = open('quicklisp.lisp')
>>> for chunk in read_in_chunks(infile):
... print chunk
...
<contents of quicklisp.lisp in chunks>
Of course, you can easily adapt this to use a with block:
with open('quicklisp.lisp') as infile:
for chunk in read_in_chunks(infile):
print chunk
And you can eliminate the if statement like this.
def read_in_chunks(infile, chunk_size=1024*64):
chunk = infile.read(chunk_size)
while chunk:
yield chunk
chunk = infile.read(chunk_size)
Try:
>>> with open('dups.txt','rb') as f:
... for chunk in iter((lambda:f.read(how_many_bytes_you_want_each_time)),''):
... i+=1
iter
needs a function with zero arguments.
- a plain
f.read
would read the whole file, since the size
parameter is missing;
f.read(1024)
means call a function and pass its return value (data loaded from file) to iter
, so iter
does not get a function at all;
(lambda:f.read(1234))
is a function that takes zero arguments (nothing between lambda
and :
) and calls f.read(1234)
.
There is equivalence between following:
somefunction = (lambda:f.read(how_many_bytes_you_want_each_time))
and
def somefunction(): return f.read(how_many_bytes_you_want_each_time)
and having one of these before your code you could just write: iter(somefunction, '')
.
Technically you can skip the parentheses around lambda, python's grammar will accept that.