How to open an ascii-encoded file as UTF8?

2019-07-04 23:16发布

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?

The problem I am tring to solve is:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)

I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.

Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?

3条回答
ら.Afraid
2楼-- · 2019-07-04 23:28

I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.

"How do I get it to load as UTF8?"

I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.

You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

查看更多
等我变得足够好
3楼-- · 2019-07-04 23:41

You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).

You need to decode the byte-string explicitly, using the .decode() function:

 template_str = template_str.decode('utf8')

Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.

Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.

查看更多
趁早两清
4楼-- · 2019-07-04 23:47

A solution working in Python2:

import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read()  ## returns unicode
assert type(content) == unicode
fo.close()

utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
查看更多
登录 后发表回答