What is causing this garbage when writing to a fil

2019-08-11 15:31发布

问题:

I am trying to figure out what is happening in this situation. I am on Windows 7 64-bit and I was experimenting with Unicode in Python.

With the following Python code

#aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa

x = [u'\xa3']

f = open('file_garbage.txt', 'w+')
for s in x:
    if s in f.read():
        continue
    else:
        f.write(s.encode('utf-8'))
f.close()

I get no error message and file_garbage.txt contains

£

when I add another item to x like so

#aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa

x = [u'\xa3',
     u'\xa3']

f = open('file_garbage.txt', 'w+')
for s in x:
    if s in f.read():
        continue
    else:
        f.write(s.encode('utf-8'))
f.close()

I get a UnicodeDecodeError

Traceback (most recent call last):
  File "file_garbage.py", line 9, in <module>
    if s in f.read():
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 2: ordinal not in range(128)

file_garbage.txt will contain either around 250 lines of bytes like this

c2a3 4b02 e0a6 5400 6161 6161 6161 6161
6161 6161 6161 6161 6161 6161 6161 6161
6161 6161 6161 6161 6161 610d 0a23 6161
6161 6161 0d0a 0d0a 7820 3d20 5b75 275c
7861 3327 2c0d 0a20 2020 2020 7527 5c78
6133 275d 0d0a 0d0a 6620 3d20 6f70 656e
2827 6669 6c65 5f67 6172 6261 6765 2e74

garbage like in this

£Kà¦éaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa

x = [u'\xa3',
     u'\xa3']

f = open('file_garbage.txt', 'w+')
for s in x:
    if s in f.read():
        continue
    else:
        f.write(s.encode('utf-8'))
f.close()
 Python Character Mapping Codec cp1252 generated from 'MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT' with gencodec.py.

iÿÿÿÿNt

followed by a bunch of ENQ, DC2, SOH, STX, NUL symbols and links to:

 C:\Python27\lib\encodings\cp1252.py

Pic of garbage:

I am guessing that this is a problem to do with encoding and/or the way I am dealing with files, but I am confused about what is happening exactly and why the results seem to differ.

The garbage seems to only be generated if those seemingly random couple of comment strings at the top of the file but the bytes will always be generated otherwise.

If it helps, my system encodings are set as follows:

sys.stdout.encoding            :  cp850
sys.stdout.isatty()            :  True
locale.getpreferredencoding()  :  cp1252
sys.getfilesystemencoding()    :  mbcs

回答1:

It is possible that the file is being corrupted because it is not closed properly. I've never seen this particular behavior but it's within the realm of possibility. Try changing your code to use with:

with open('file_garbage.txt', 'w+') as f:
    # do your stuff here

This will ensure that the file is closed even if an exception is raised.

The cause of the exception is that x contains unicode strings, but when you read in f you are reading in bytes. When you try to check s in f.read(), it tries to compare the unicode string to the bytes in the file, and fails because the bytes in the file can't be interpreted as unicode. You need to decode the contents of the file back into unicode.

Your code has a few other problems that are somewhat outside the scope of this question. For starters, using f.read() in a loop like that won't work, because the first read will read the whole file, and subsequent reads will return nothing. Instead, read (and decode) the file into a variable first, then do your comparison against that variable. Also, I'm not sure if reading and writing the file in w+ mode will do what you want. (I'm not actually sure what you want your code to do.) As documented, w+ truncates the file, so you won't be able to "update" it by adding to what's already there.