I am trying to figure out what is happening in this situation. I am on Windows 7 64-bit and I was experimenting with Unicode in Python.
With the following Python code
#aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa
x = [u'\xa3']
f = open('file_garbage.txt', 'w+')
for s in x:
if s in f.read():
continue
else:
f.write(s.encode('utf-8'))
f.close()
I get no error message and file_garbage.txt contains
£
when I add another item to x like so
#aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa
x = [u'\xa3',
u'\xa3']
f = open('file_garbage.txt', 'w+')
for s in x:
if s in f.read():
continue
else:
f.write(s.encode('utf-8'))
f.close()
I get a UnicodeDecodeError
Traceback (most recent call last):
File "file_garbage.py", line 9, in <module>
if s in f.read():
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 2: ordinal not in range(128)
file_garbage.txt will contain either around 250 lines of bytes like this
c2a3 4b02 e0a6 5400 6161 6161 6161 6161
6161 6161 6161 6161 6161 6161 6161 6161
6161 6161 6161 6161 6161 610d 0a23 6161
6161 6161 0d0a 0d0a 7820 3d20 5b75 275c
7861 3327 2c0d 0a20 2020 2020 7527 5c78
6133 275d 0d0a 0d0a 6620 3d20 6f70 656e
2827 6669 6c65 5f67 6172 6261 6765 2e74
garbage like in this
£Kà¦éaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#aaaaaa
x = [u'\xa3',
u'\xa3']
f = open('file_garbage.txt', 'w+')
for s in x:
if s in f.read():
continue
else:
f.write(s.encode('utf-8'))
f.close()
Python Character Mapping Codec cp1252 generated from 'MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT' with gencodec.py.
iÿÿÿÿNt
followed by a bunch of ENQ, DC2, SOH, STX, NUL symbols and links to:
C:\Python27\lib\encodings\cp1252.py
Pic of garbage:
I am guessing that this is a problem to do with encoding and/or the way I am dealing with files, but I am confused about what is happening exactly and why the results seem to differ.
The garbage seems to only be generated if those seemingly random couple of comment strings at the top of the file but the bytes will always be generated otherwise.
If it helps, my system encodings are set as follows:
sys.stdout.encoding : cp850
sys.stdout.isatty() : True
locale.getpreferredencoding() : cp1252
sys.getfilesystemencoding() : mbcs