How do I convert a file's format from Unicode

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.

What is the best way to convert the entire file format using Python?

标签： python unicode encoding file ascii

8条回答

不美不萌又怎样

2楼-- · 2019-01-13 14:00

Here's some simple (and stupid) code to do encoding translation. I'm assuming (but you shouldn't) that the input file is in UTF-16 (Windows calls this simply 'Unicode').

input_codec = 'UTF-16'
output_codec = 'ASCII'

unicode_file = open('filename')
unicode_data = unicode_file.read().decode(input_codec)
ascii_file = open('new filename', 'w')
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec)))

Note that this will not work if there are any characters in the Unicode file that are not also ASCII characters. You can do the following to turn unrecognized characters into '?'s:

ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace')))

Check out the docs for more simple choices. If you need to do anything more sophisticated, you may wish to check out The UNICODE Hammer at the Python Cookbook.

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-01-13 14:04

I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.

This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html

Here's a useful quote from the site:

Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding:

> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>

All three of these return the same thing, since the characters in 'Hello' are common to all three encodings.

Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1.

> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'

If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous.

Unicode supports all the common operations such as iteration and splitting. We won't run over them here.

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2019-01-13 14:06

Like this:

uc = open(filename).read().decode('utf8')
ascii = uc.decode('ascii')

Note, however, that this will fail with a UnicodeDecodeError exception if there are any characters that can't be converted to ASCII.

EDIT: As Pete Karl just pointed out, there is no one-to-one mapping from Unicode to ASCII. So some characters simply can't be converted in an information-preserving way. Moreover, standard ASCII is more or less a subset of UTF-8, so you don't really even need to do any decoding.

0人赞添加讨论(0) 举报

Luminary・发光体

5楼-- · 2019-01-13 14:07

For my problem where I just wanted to skip the Non-ascii characters and just output only ascii output, the below solution worked really well:

    import unicodedata
    input = open(filename).read().decode('UTF-16')
    output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

0人赞添加讨论(0) 举报

beautiful°

6楼-- · 2019-01-13 14:13

As other posters have noted, ASCII is a subset of unicode.

However if you:

have a legacy app
you don't control the code for that app
you're sure your input falls into the ASCII subset

Then the example below shows how to do it:

mystring = u'bar'
type(mystring)
    <type 'unicode'>

myasciistring = (mystring.encode('ASCII'))
type(myasciistring)
    <type 'str'>

0人赞添加讨论(0) 举报

老娘就宠你

7楼-- · 2019-01-13 14:14

By the way, these is a linux command iconv to do this kind of job.

iconv -f utf8 -t ascii <input.txt >output.txt

0人赞添加讨论(0) 举报

1 2 下一页

How do I convert a file's format from Unicode

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间