How do I convert a file's format from Unicode

2019-01-13 13:32发布

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.

What is the best way to convert the entire file format using Python?

8条回答
男人必须洒脱
2楼-- · 2019-01-13 14:14

It's important to note that there is no 'Unicode' file format. Unicode can be encoded to bytes in several different ways. Most commonly UTF-8 or UTF-16. You'll need to know which one your 3rd-party tool is outputting. Once you know that, converting between different encodings is pretty easy:

in_file = open("myfile.txt", "rb")
out_file = open("mynewfile.txt", "wb")

in_byte_string = in_file.read()
unicode_string = bytestring.decode('UTF-16')
out_byte_string = unicode_string.encode('ASCII')

out_file.write(out_byte_string)
out_file.close()

As noted in the other replies, you're probably going to want to supply an error handler to the encode method. Using 'replace' as the error handler is simple, but will mangle your text if it contains characters that cannot be represented in ASCII.

查看更多
Explosion°爆炸
3楼-- · 2019-01-13 14:22

You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.

This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.

>>> title = u"Klüft skräms inför på fédéral électoral große"

is typically converted to

Klft skrms infr p fdral lectoral groe

which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
查看更多
登录 后发表回答