Unicode (UTF-8) reading and writing to files in Py

2019-01-01 08:06发布

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

13条回答
孤独寂梦人
2楼-- · 2019-01-01 08:11
# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()
查看更多
梦该遗忘
3楼-- · 2019-01-01 08:11

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
查看更多
公子世无双
4楼-- · 2019-01-01 08:12

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

查看更多
永恒的永恒
5楼-- · 2019-01-01 08:12

I was trying to parse iCal using Python 2.7.9:

from icalendar import Calendar

But I was getting:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

and it was fixed with just:

print "{}".format(e[attr].encode("utf-8"))

(Now it can print liké á böss.)

查看更多
柔情千种
6楼-- · 2019-01-01 08:15

The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.

How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.

查看更多
ら面具成の殇う
7楼-- · 2019-01-01 08:16

So, I've found a solution for what I'm looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.

This allows for the sort of round trip that I was imagining.

查看更多
登录 后发表回答