I am trying to extract a number of locations from an existing file using Python. This is my current code for extracting the locations:
self.fh = open( fileName , "r+")
p = re.compile('regGen regPorSnip begin')
for line in self.fh :
if ( p.search(line) ):
self.porSnipStartFPtr = self.fh.tell()
sys.stdout.write("found regPorSnip")
This snippet is repeated a number of times (less the file open) with different search values, and seems to work: I get the correct messages, and the variables have values.
However, using the code below, the first write location is wrong, while subsequent write locations are correct:
self.fh.seek(self.rstSnipStartFPtr,0)
self.fh.write(str);
sys.stdout.write("writing %s" % str )
self.rstSnipStartFPtr = self.fh.tell()
I have read that passing certain read
/readline
options to fh
can cause an erroneous tell value because of Python's tendency to 'read ahead'. One suggestion I saw for avoiding this is to read the whole file and rewrite it, which isn't a very appealing solution in my application.
If i change the first code snippet to:
for line in self.fh.read() :
if ( p.search(line) ):
self.porSnipStartFPtr = self.fh.tell()
sys.stdout.write("found regPorSnip")
Then it appears that self.fh.read()
is returning only characters and not an entire line. The search never matches. The same appears to hold true for self.fh.readline()
.
My conclusion is that fh.tell
only returns valid file locations when queried after a write operation.
Is there a way to extract the accurate file location when reading/searching?
Thanks.
The cause is (rather obscurely) explained in the docs for a file object's
next()
method:The values returned by
tell()
reflect how far this hidden read-ahead buffer has gotten, which will typically be up to a few thousand bytes beyond the characters your program has actually retrieved.There's no portable way around this. If you need to mix
tell()
with reading lines, then use the file'sreadline()
method instead. The tradeoff is that, in return for getting usabletell()
results, iterating over a large file withreadline()
is typically significantly slower than usingfor line in file_object:
.Code
Concretely, change the loop to this:
I'm not sure that's what you really want, though:
tell()
is capturing the position of the start of the next line. If want the position of the start of the line, then you need to change the logic, like so:or do it with a "loop and a half":
I guess I dont understand the issue