I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?
in.txt:
Stäckövérfløw
code.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
print line
for i in line:
print i,
f.close()
output:
Stäckövérfløw
S t � � c k � � v � � r f l � � w
When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use
If
sys.stdout
doesn't already have the appropriate encoding set, you may have to wrap it:Use codecs.open instead, it works for me.
One may want to just use
Check this out:
It returns this:
Stäckövérfløw
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w
The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.
Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output