Python in-place write to file at arbitrary positio

2019-09-11 03:29发布

问题:

I'm trying to edit a text file in-place in python. It is very large (so loading it into memory is not an option). I intend to replace byte-for-byte strings I find inside.

with f as open("filename.txt", "r+b"):
    if f.read(8) == "01234567":
        f.seek(-8, 1)
        f.write("87654321")

However, the write() operation adds onto the end of the file when I tried it:

>>> n.read()
'sdf'
>>> n.read(1)
''
>>> n.seek(0,0)
>>> n.read(1)
's'
>>> n.read(1)
'd'
>>> n.write("sdf")
>>> n.read(1)
''
>>> n.seek(0,0)
>>> n.read()
'sdfsdf'
`

I want the result of that to be sdsdf.

回答1:

The original ANSI / ISO C standards required a seek operation when switching a read-write mode stream from read mode to write mode, and vice versa. This restriction persists, e.g., n1570 includes this text:

When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function (fseek, fsetpos, or rewind), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file. Opening (or creating) a text file with update mode may instead open (or create) a binary stream in some implementations.

For whatever reason this restriction has been imported into Python,1 even though it would be possible for the Python wrappers to handle it automatically.

For what it's worth, the reason for the original ANSI C restriction was the low-budget implementation found on many Unix-based systems: they kept, for each stream, a "current byte count" and "current pointer". The current byte count was 0 if the macro-ized getc and putc operations had to call into underlying implementation, which could check whether a stream was opened in update mode and switch it as needed. But once you successfully obtained a character, the counter would hold the number of characters that could continue to be read from the underlying stream; and once you successfully wrote a character, the counter would hold the number of buffer-locations that allowed adding characters.

This meant that if you did a successful getc that filled an internal buffer, but followed it by a putc, the "written" character from putc would simply overwrite the buffered data. If you had a successful putc but followed with a poorly-implemented getc, you would see un-set value out of the buffer.

This problem was trivial to fix (just provide separate input and output counters, one of which is always zero, and have the functions that implement buffer-refill check for mode-switch as well).


1Citation needed :-)



回答2:

You can check the difference of following codes:

>>> f = open("file.txt", "r+b")
>>> f.seek(2)
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdsdf'


>>> f = open("file.txt", "r+b")
>>> f.read(1)
's'
>>> f.read(1)
'd'
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdfsdf'

The pointer of .write is originally at the end of the file. Only .seek() will change its position, but not .read(). So you have to call .seek() before writing the bytes. The following code works well:

>>> f = open("file.txt", "r+b")
>>> f.read(1)
's'
>>> f.read(1)
'd'
>>> f.seek(2)
>>> f.write("sdf")
>>> f.seek(0)
>>> f.read()
'sdsdf'