Is it safe to mix readline() and line iterators in

2019-01-20 08:06发布

问题:

Is it safe to read some lines with readline() and also use for line in file, and is it guaranteed to use the same file position?

Usually, I want to disregard the first line (headers), so I do this:

FI = open("myfile.txt")
FI.readline()             # disregard the first line
for line in FI:
    my_process(line)
FI.close()

Is this safe, i.e., is it guaranteed that the same file position variable is used while iterating lines?

回答1:

No, it isn't safe:

As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right.

You could use next() to skip the first line here. You should also test for StopIteration, which will be raised if the file is empty.

with open('myfile.txt') as f:
    try:
        header = next(f)
    except StopIteration as e:
        print "File is empty"
    for line in f:
        # do stuff with line


回答2:

This works out well in the long run. It ignores the fact that you're processing a file, and works with any sequence. Also, having the explicit iterator object (rdr) hanging around allows you to skip lines inside the body of for loop without messing anything up.

with open("myfile.txt","r") as source:
    rdr= iter(source)
    heading= next(rdr)
    for line in rdr:
        process( line )


回答3:

It is safe if the mechanisms are under control.

=============================

.

There is no problem to do an iteration after a readline() instruction

But there's one to execute a readline() after an iteration

I created a 'rara.txt' file with this text ( each line have a length of 5 because of the '\r\n' end of line under Windows)

1AA
2BB
3CC
4DD
5EE
6FF
7GG
8HH
9II
10j
11k
12l
13m
14n
15o

And I executed

FI  = open("rara.txt",'rb')
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

cnt = 0
for line in FI:
    cnt += 1
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
    if cnt==4:
        break
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell(),'\n'


lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell()
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

for line in FI:
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell(),'\n'

The result is

'1AA\r\n'   len==5  FI.tell() after FI.readline() :  5 

cnt==1   '2BB\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==2   '3CC\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==3   '4DD\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==4   '5EE\r\n'   len==5  FI.tell() after 'line in FI' :  75

FI.tell() after iteration 'for line in FI' :  75 


Traceback (most recent call last):
  File "E:\Python\NNN codes\esssssai.py", line 16, in <module>
    lineR = FI.readline()
ValueError: Mixing iteration and read methods would lose data

.

A strange thing is that if we renew the "cursor" by tell() , method readline() can be active again after an iteration (I don't know what is the behind-the-scene mechanism of "cursor" renewal ):

FI  = open("rara.txt",'rb')
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

cnt = 0
for line in FI:
    cnt += 1
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
    if cnt==4:
        pos = FI.tell()
        break
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell(),'\n'

FI.seek(pos)

lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell()
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

for line in FI:
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell(),'\n'

result

'1AA\r\n'   len==5  FI.tell() after FI.readline() :  5 

cnt==1   '2BB\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==2   '3CC\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==3   '4DD\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==4   '5EE\r\n'   len==5  FI.tell() after 'line in FI' :  75

FI.tell() after iteration 'for line in FI' :  75 

''   len==0  FI.tell() after FI.readline() :  75
''   len==0  FI.tell() after FI.readline() :  75 


FI.tell() after iteration 'for line in FI' :  75 

Anyway, we note that even if the algorithm is to read only 4 lines during iteration (thanks to the count cnt) , the cursor goes already at the end of the file from the very beginning of the iteration: all the file, ahead of the current position when the iteration begins, is once read.

So pos = FI.tell() before the break doesn't give the position after the 4 lines read, but the position of the end of the file.


.

We must do something special if we want to readline() again , after an iteration , from the exact point at which ended the 4 lines reading during an iteration:

FI  = open("rara.txt",'rb')
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

cnt = 0
pos = FI.tell()
for line in FI:
    cnt += 1
    pos += len(line)
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
    if cnt==4:
        break
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell()
print "    pos   after iteration 'for line in FI' : ",pos,'\n'

FI.seek(pos)

lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell()
lineR = FI.readline()
print repr(lineR)+'   len=='+str(len(lineR))+\
      '  FI.tell() after FI.readline() : ',FI.tell(),'\n'

cnt = 0
for line in FI:
    cnt += 1
    print 'cnt=='+str(cnt)+'   '+repr(line)+'   len=='+str(len(line))+\
          "  FI.tell() after 'line in FI' : ",FI.tell()
print "\nFI.tell() after iteration 'for line in FI' : ",FI.tell(),'\n'

result

'1AA\r\n'   len==5  FI.tell() after FI.readline() :  5 

cnt==1   '2BB\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==2   '3CC\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==3   '4DD\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==4   '5EE\r\n'   len==5  FI.tell() after 'line in FI' :  75

FI.tell() after iteration 'for line in FI' :  75
    pos   after iteration 'for line in FI' :  25 

'6FF\r\n'   len==5  FI.tell() after FI.readline() :  30
'7GG\r\n'   len==5  FI.tell() after FI.readline() :  35 

cnt==1   '8HH\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==2   '9II\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==3   '10j\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==4   '11k\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==5   '12l\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==6   '13m\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==7   '14n\r\n'   len==5  FI.tell() after 'line in FI' :  75
cnt==8   '15o\r\n'   len==5  FI.tell() after 'line in FI' :  75

FI.tell() after iteration 'for line in FI' :  75 

.

All these manipulations are possible only because the file was opened in binary mode, because I am on Windows which uses '\r\n' as end of lines to write a file, even if it is ordered to write (in 'w' mode) something like 'abcdef\n',

while on the other hand Python transforms (in mode 'r') all the '\r\n' in '\n'.

That's a mess, and to control all this, files must be opened in 'rb' if we want to do precise manipulations.


.

You know what ? I love these games in the positions of a file