python write file dealing with encode

2020-06-29 06:19发布

问题:

I'm confused. I need HELP!!! I'm dealing with a file contains Chinese characters,for instance, let's call it a.TEST, and here is what's inside.

你好 中国 Hello China 1 2 3

You don't need to understand what the chinese means.(Actually it's 'hello China')

>>> f=open('wr.TRAIN')
>>> print f.read()
你好 中国 Hello China 1 2 3

>>> f.seek(0)
>>> content = f.readline()
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>> print content
你好 中国 Hello China 1 2 3
>>> type(content)
<type 'str'>
>>> isinstance(content,unicode)
False

Here comes the first Question: Why python shell give me the utf-8of content when i just type content,meanwhile print content cmd can output the form that I want to see?

The Second Question: what's the difference between unicode and str? Someone told me that encode is convert unicode to str, but what i learned from Unicode HowTo tells me encode is convert unicode to utf-8

Not over yet! :)

here is test.py

#!/usr/bin/python
#-*- coding: utf-8 -*-

fr = open('a.TEST')
fw = open('out.TEST','w')

content = fr.readline()
content_list = content.split()
print content
fw.write('{0}'.format(content_list))

fr.close()
fw.close()

Third Question:Why the chinese character turn into utf-8 code when I do .split()?

and I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work, but it doesn't. I don't want what's written into out.TEST is character encoding form, I want it to be exactly the character that look like originally(你好). How to do it?

回答1:

What is Encoding

A file consists of bytes. You can represent each byte with a number between 0 and 255 (or 0x00 and 0xFF in hexadecimal).

Text is also written as bytes. There is an agreement on the way text is written. That is an encoding. The most basic encoding is ASCII and other encodings are usually based on it. For example, ASCII defines that number 65 (0x41) represents 'A', 66 (0x42) represents 'B' etc.

How are Strings Represented

In python, you can define a string using numeric values:

>>> '\x41\x42\x43'
'ABC'

'\x41\x42\x43' is exactly the same thing as 'ABC'. Python will always represent the string using the more readable textual representation ('ABC').

However, some numeric values are not printable characters, so they will be represented in numeric form:

>>> '\x00\x01\x02\x03\x04'
'\x00\x01\x02\x03\x04'

Others characters have aliases to make your job easier:

>>> '\x0a\x0d\x09'
'\n\r\t'

Different Encodings

ASCII table defines meaning of numbers 0-127 and includes only the english alphabet. Numbers 128-255 are undefined. So, other encodings define a meaning for 128-255. Yet others change the meaning of the whole range 0-255.

There are many encodings and they define 128-255 differently.

For example, character 185 (0xB9) is ą in windows-1250 encoding, but it is š in iso-8859-2 encoding.

So, what happens if you print \xb9? It depends on the encoding used in the console. In my case (my console uses cp852 encoding) it is:

>>> print '\xb9'
╣

Because of that ambiguity, string '\xb9' will never be represented as '╣' (nor 'ą'...). That would hide the true value. It will be represented as the numeric value:

>>> '\xb9'
'\xb9'

Also:

>>> '╣'
'\xb9'

See also the string from the question in my console:

>>> content = '\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> print content
ńŻáňąŻ ńŞşňŤŻ Hello China 1 2 3

But what happens if variable is just entered in the console?

When a variable is enteren in cosole without print, its representation is printed. It is the same as the following:

>>> print repr(content)
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'

What is Unicode?

Unicode table aims to define a numeric representation of all characters in the world and more. It can actually do that, because it is not limited to 256 values (or to any other limit actually). This is not an encoding, but a universal mapping of numbers to characters.

For example, unicode defines that number 353 (0x0161) is character š. That is allways true regardless of your locale and encodings you use. That character can be stored in files (or memory) in any encoding which supports š.

What is UTF-8?

When encoding a unicode character, one can use any encoding, but not all of them will support all characters.

For example, š (unicode 0x0161) can be encoded in iso-8869-2 as 0xB9, but it cannot be encoded in iso-8869-1 at all.

So, to be able to encode anything, you need an encoding which supports every unicode character. UTF-8 is one of those encodings, but there are others:

>>> u'\u0161'.encode('utf-7')
'+AWE-'
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> u'\u0161'.encode('utf-16le')
'a\x01'
>>> u'\u0161'.encode('utf-16be')
'\x01a'
>>> u'\u0161'.encode('utf-32le')
'a\x01\x00\x00'
>>> u'\u0161'.encode('utf-32be')
'\x00\x00\x01a'

The good thing about utf-8 is that the whole ASCII range is unchanged and as long as only ASCII is used, only one byte is used per character:

>>> u'abcdefg'.encode('utf-8')
'abcdefg'

Unicode in Python 2

Important: This is really specific to Python 2. Python 3 is different.

Unlike str objects, which are strings of bytes, unicode objects are strings of unicode characters.

They can be encoded into a str in chosen encoding, or decoded from str in chosen encoding.

A unicode string is specified using u before the opening quote. The characters inside are interpreted using current encoding, or they can be specified in numeric format \uHEX:

>>> u'ABCD'
u'ABCD'
>>>
>>> u'\u0041\u0042\u0043'
u'ABC'
>>> u'šâů'
u'\u0161\xe2\u016f'

And Now the Answers

First Question

  • contents prints repr(contents)
  • print contents prints contents

Second Question

UTF-8 strings are byte strings (str). You get them by encoding the unicode:

>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> '\xc5\xa1'.decode('utf-8')
u'\u0161'

So yes, encode converts unicode to str. The str can be utf-8, but it does not have to be.

Third Question

A) "Why the chinese character turn into utf-8 code when I do .split()?"

They were utf-8 all the time.

B) "I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work"

content_list is not a string. It is a list. When a list is converted to a string, it is done using its repr, which also does repr of all of the contents.

For example:

>>> 'a \n a \n a'
'a \n a \n a'
>>> print 'a \n a \n a'
a
 a
 a
>>> print ['a \n a \n a']
['a \n a \n a']

The last print printed repr(list) which contains repr(str).



回答2:

In the beginning, there was just english characters, and people was not satisfied.

Then they want to display every character in the world.But there is problem. One byte can only represent 255 characters. There just simply not enough place to hold them.

Then people decide to use two byte to represent one character.And call it 'utf8'.

No matter what characters you write in, it's all store in byte form.

In Python, there is no such datatype called 'unicode', just 'str'. And 'unicode' is an encoding system of 'str'.

'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd' is byte form of "你好 中国". It can not display without an encoding system specified.

I suppose you could blame linux/unix. Python has no problem to display 'utf-8' characters, while 'cat' cannot.