I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
- Open a file with EditPlus (it shows the file's encoding is
UTF8+BOM
) - In EditPlus, save the file as
ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file: contents = source_file.read() with open(html, 'w+b') as dest_file: dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
I'm using the
io.open()
function here instead ofcodecs
as the more robust method;io
is the new Python 3 library also backported to Python 2.Demo: