Convert different encodings to ascii

2019-07-13 21:06发布

问题:

I have a hundred files and according to chardet each file is encoded with one of the following:

['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']

So I know the files encoding, therefore I know what encoding to open the file with.

I wish to convert all files to ascii only. I also wish to convert different versions of characters like - and ' to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8") should be converted to -. The most important thing is that the text is easy to read. I don't want don t for example, but rather don't instead.

How might I do this?

I can use either Python 2 or 3 to solve this.

This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.

for file_name in os.listdir('.'):
        print(file_name)
        r = chardet.detect(open(file_name).read())
        charenc = r['encoding']
        with open(file_name,"r" ) as f:
            for line in f.readlines():
              if line.decode(charenc) != line.decode("ascii","ignore"):
                print(line.decode("ascii","ignore"))

This gives me the following exception:

    if line.decode(charenc) != line.decode("ascii","ignore"):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data

回答1:

Don't use .readlines() an a binary file with multi-byte lines. In UTF-16, little-endian, a newline is encoded as two bytes, 0A (in ASCII a newline) and 00 (a NULL). .readlines() splits on the first of those two bytes, leaving you with incomplete data to decode.

Reopen the file with the io library for ease of decoding:

import io

for file_name in os.listdir('.'):
    print(file_name)
    r = chardet.detect(open(file_name).read())
    charenc = r['encoding']
    with io.open(file_name, "r", encoding=charenc) as f:
        for line in f:
            line = line.encode("ascii", "ignore"):
            print line

To replace specific unicode codepoints with ASCII-friendly characters, use a dictionary mapping codepoint to codepoint or unicode string and call line.translate() first:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}

line = line.translate(charmap)

I used hexadecimal integer literals to define the unicode codepoints to map from here. The value in the dictionary must be a unicode string, an integer (a codepoint) or None to delete that codepoint altogether.