Python: Why am I getting a UnicodeDecodeError?

2020-07-31 11:46发布

问题:

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.

import os
import gzip
import re
import shutil

def regEx1():
    os.chdir("C:/Users/David/myfiles")
    files = os.listdir(".")
    os.mkdir("C:/Users/David/NewFiles")
    regex_txt = input("Please enter the string your are looking for:")
    for x in (files):
        inputFile = open((x), "r")
        content = inputFile.read()
        inputFile.close()
        regex = re.compile(regex_txt, re.IGNORECASE)
        if re.search(regex, content)is not None:
            shutil.copy(x, "C:/Users/David/NewFiles")

When I run it i get the following error message:

Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>

Please could someone explain why this message appears

回答1:

In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.

Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.

You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).

See the Python Unicode HOWTO for more information.



回答2:

I'm not too familiar with python 3x, but the below may work.

inputFile = open((x, encoding="utf8"), "r")


回答3:

There's a similar question here: Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]

But you might want to try:

 open((x), "r", encoding='UTF8')


回答4:

Thank you very much for this solution. It helps me for another subject, I used :

exec (open ("DIP6.py").read ())

and I got this error because I have this symbol in a comment of DIP6.py :

 #       ● en première colonne

It works fine with :

exec (open ("DIP6.py", encoding="utf8").read ())

It also solves a problem with :

print("été") for example

in DIP6.py

I got :

été

in the console.

Thank you :-) .