Python can not open UTF-8 encoded text file

I have .py script which contains following code to open specific text file (which was generated by Exchange Powershell):

with codecs.open("C:\\Temp\\myfile.txt",encoding="utf_8",mode="r",errors="replace") as myfile:
    content = myfile.readlines() #here we convert lines to list
    print(content)

however, i tried also utf-16-be and utf-16-le (and standard ASCII obviously), but the file output is still looking like this (this is just part of it):

['��\r', '\x00\n', '\x00D\x00o\x00m\x00a\x00i\x00n\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r', '\x00\n', '\x00-\x00-\x00-\x00-\x00-\x00-\x00

the file which i am trying to open is located here

does anybody please know what am i doing wrong? Is this some different kind of encoding?

标签： python powershell utf-8

2条回答

傲

2楼-- · 2020-03-31 07:42

The reason you are getting the error is because you are trying to open a file encoded in 'UTF-16' in UTF-8.

UTF-16 allows a Byte Order Mark (BOM), a code point with the value U+FEFF, to precede the first actual coded value. The byte order mark (BOM) is a Unicode character, U+FEFF byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:

What byte order, or endianness, the text stream is stored in; The fact that the text stream is Unicode, to a high level of confidence;
Which of several Unicode encodings that text stream is encoded as.
BOM use is optional, and, if used, should appear at the start of the text stream.

If you open the file as "rb" i.e with the intention to read it as a byte stream this should be the first line of the output:-

b'\xff\xfe\r\x00\n'

This is the BOM I was talking about.

If you run the following code:-

with open("myfile.txt", "r", encoding="utf-16") as file:
    for line in file.readlines():
        print(line)

your output will have no errors.

If you need to use UTF-8 for a some specific reason try to update the input file in byte format and removing the first line i.e. b'\xff\xfe\r\x00\n' Though I am not sure about the specifics.

For more refer:-]

BOM

UTF-16

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2020-03-31 07:46

First, this text is definitely not UTF-8, so that's why Python can't open it as a UTF-8-encoded text file.

Second, you claim you "tried also utf-16-be and utf-16-le", but didn't show how you did that, and I suspect you did it wrong.

From the output, this is very likely BOM-encoded UTF-16-LE.

The first two bytes—because of the way you've printed them, we can't tell which bytes they are, but this is what it looks like when you print out \xFF and \xFE bytes. And the rest of the strings are a bunch of NUL even bytes alternating with reasonable-looking bytes, which almost always means UTF-16-LE. Plus, most common two-byte with a BOM in the wild is UTF-16-LE, and the fact that you're using all Microsoft tools makes that even more likely.

So, if you'd really tried utf-16-le, you would almost certainly have gotten the right string, but with an extra \ufeff at the start.

But of course the right answer is to just decode it as 'utf-16', which will consume and use the BOM properly.

0人赞添加讨论(0) 举报

Python can not open UTF-8 encoded text file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间