Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:

I pull file up in notepad

Save as...

change encoding from unicode to UTF-8

Then run python program on it

So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.

Anybody been through this and know which method I should use and how to do it?

EDIT:

In the python3 repr, I did

>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')

So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you @Mark Ransom

标签： unicode encoding utf-8 python-3.4

1条回答

倾城　Initia

2楼-- · 2019-07-24 12:47

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

To convert to UTF-8, you could use:

with open('log.txt',encoding='utf16') as f:
    data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
    f.write(data)

Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

0人赞添加讨论(0) 举报

Python 3 unicode to utf-8 on file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间