I have a hundred files and according to chardet each file is encoded with one of the following:
['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']
So I know the files encoding, therefore I know what encoding to open the file with.
I wish to convert all files to ascii only. I also wish to convert different versions of characters like -
and '
to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8")
should be converted to -
. The most important thing is that the text is easy to read. I don't want don t
for example, but rather don't
instead.
How might I do this?
I can use either Python 2 or 3 to solve this.
This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.
for file_name in os.listdir('.'):
print(file_name)
r = chardet.detect(open(file_name).read())
charenc = r['encoding']
with open(file_name,"r" ) as f:
for line in f.readlines():
if line.decode(charenc) != line.decode("ascii","ignore"):
print(line.decode("ascii","ignore"))
This gives me the following exception:
if line.decode(charenc) != line.decode("ascii","ignore"):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data