Python: how to convert from Windows 1251 to Unicod

2020-02-01 03:59发布

问题:

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.

#!/usr/bin/env python

import os
import sys
import shutil

def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')

# try to open the file and exit if some IOError occurs
try:
    f = open(filename, 'r').read()
except Exception:
    sys.exit(1)

# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
    try:
        # try to decode the file with the first encoding
        # from the tuple.
        # if it succeeds then it will reach break, so we
        # will be out of the loop (something we want on
        # success).
        # the data variable will hold our decoded text
        data = f.decode(enc)
        break
    except Exception:
        # if the first encoding fail, then with the continue
        # keyword will start again with the second encoding
        # from the tuple an so on.... until it succeeds.
        # if for some reason it reaches the last encoding of
        # our tuple without success, then exit the program.
        if enc == encodings[-1]:
            sys.exit(1)
        continue

# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)

# and at last convert it to utf-8
f = open(filename, 'w')
try:
    f.write(data.encode('utf-8'))
except Exception, e:
    print e
finally:
    f.close()

How can I do that?

Thank you

回答1:

import codecs

f = codecs.open(filename, 'r', 'cp1251')
u = f.read()   # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u)   # and now the contents have been output as UTF-8

Is this what you intend to do?



回答2:

This is just a guess, since you didn't specify what you mean by "doesn't work".

If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).



回答3:

If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:

import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)

This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".