I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
- I pull file up in notepad
- Save as...
- change encoding from
unicode
toUTF-8
- Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8
or something like open(filename,'r',encoding='utf-8')
although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252')
. I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you @Mark Ransom
What notepad considers
Unicode
isutf16
to Python. Windows "Unicode" files start with a byte order mark (BOM) ofFF FE
, which indicates little-endian UTF-16. This is why you get the following when usingutf8
to decode the file:To convert to UTF-8, you could use:
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume
ANSI
instead.ANSI
is really the local language locale. On US Windows it iscp1252
, but it varies for other localized builds. If you openutf8.txt
and it still looks garbled, useencoding='utf-8-sig'
when writing instead.