Python 3 unicode to utf-8 on file

2019-07-24 13:00发布

问题:

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:

  • I pull file up in notepad
  • Save as...
  • change encoding from unicode to UTF-8
  • Then run python program on it

So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.

Anybody been through this and know which method I should use and how to do it?

EDIT:

In the python3 repr, I did

>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')

So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you @Mark Ransom

回答1:

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

To convert to UTF-8, you could use:

with open('log.txt',encoding='utf16') as f:
    data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
    f.write(data)

Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.