Python - Decode UTF-16 file with BOM

2020-02-23 08:51发布

I have a UTF-16 LE file with BOM. I'd like to flip this file in to UTF-8 without BOM so I can parse it using Python.

The usual code that I use didn't do the trick, it returned unknown characters instead of the actual file contents.

f = open('dbo.chrRaces.Table.sql').read()
f = str(f).decode('utf-16le', errors='ignore').encode('utf8')
print f

What would be the proper way to decode this file so I can parse through it with f.readlines()?

标签： python file encoding utf-8 utf-16

1条回答

啃猪蹄的小仙女

2楼-- · 2020-02-23 09:18

Firstly, you should read in binary mode, otherwise things will get confusing.

Then, check for and remove the BOM, since it is part of the file, but not part of the actual text.

import codecs
encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
bom= codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
encoded_text= encoded_text[len(bom):]                         #strip away the BOM
decoded_text= encoded_text.decode('utf-16le')                 #decode to unicode

Don't encode (to utf-8 or otherwise) until you're done with all parsing/processing. You should do all that using unicode strings.

Also, errors='ignore' on decode may be a bad idea. Consider what's worse: having your program tell you something is wrong and stop, or returning wrong data?

0人赞添加讨论(0) 举报

Python - Decode UTF-16 file with BOM

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间