UnicodeDecodeError: 'ascii' codec can'

2019-02-15 13:41发布

问题:

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.

First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.

I figured out that my Postgres database has UTF8 encoding.

The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.

This is the error I got now: UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

What does it mean? How do I fix it?

EDIT: I am using Python 2.7 Some pieces of my code:

I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench. This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

This is the code that writes to Postgres database from my StringIO object table_data:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

回答1:

The problem is that you're calling encode on a str object.

A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.

So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.

The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.

But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.

Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.

In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.

As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.