My files are in US-ASCII and a command like a = file( 'main.html')
and a.read()
loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val)
. But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type
unicode
. In Python 3, the resulting data will be of typestr
.You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the
.decode()
function:Your
val
variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2: