python translate bytecode to utf-8 using a variabl

2019-08-26 17:12发布

I have the following problem:

From a SQL Server database I am reading data using python module pypyodbc and ODBC Driver 13 for SQL Server and writing to txt files.

Database contains all kinds of special characters and they read as:

'PR\xc3\x86KVAL'

The '\xc3\x86' part is bytecode and should be interpreted that way. The other characters should be interpreted as shown. UTF8 would translate '\xc3\x86' to Æ.

If I type the value in b'PR\xc3\x86KVAL' , python recognizes it as bytecode and I can translate it to PRÆKVAL. See below:

s = b'PR\xc3\x86KVAL'
print(s)
bb = s.decode('utf-8')
print(bb)

The problem is that I don’t know how I can turn 'PR\xc3\x86KVAL’ to be recognized as a bytecode object.

I want the value that has to be decoded to be a variable so that all data from database can flow through it.

I Also tried ast.literal_eval(r”b'PR\xc3\x86KVAL'”), but variables won’t work in this way.

1条回答
Viruses.
2楼-- · 2019-08-26 17:25

Since you start out with PR\xc3\x86KVAL as a text string and decode indeed expects a raw byte sequence, you need to convert the text string into a bytes object. But when converting from one "encoding" value to another, Python needs to know what encoding it is starting with!

The easiest way to do so is explicitly encoding the string, using an encoding that does not change the special characters. You must be careful, because it is very well possible that a character code might be translated to something else, destroying their meaning.

You can see that with a simple example: attempting to tell Python this should be plain ASCII fails, for an obvious reason.

>>> s = 'PR\xc3\x86KVAL'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

Even though there are more than 1,000 questions on Stack Overflow about this, the reason for the failure should be easy to understand. All an encoder/decoder pair does is translate each character from 'source' to 'destination'. This can only work if the character in question actually exists in both the 'source' and 'destination' encodings. Suppose you want to translate a Greek character β to a Russian б, then the source must be able to decode the Greek character (because that is what you entered it in) and the destination must be able to encode the Russian character.

So you must be careful to choose an encoding which does not change the character \x86 in your input string into Ж (which it would do when using cp866, for example).

Fortunately, as quoted from https://stackoverflow.com/a/2617930/2564301, there is an encoding that does not mess up things:

Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to Unicode characters 0-255, which is kinda elegant.

and so this should work:

>>> s = 'PR\xc3\x86KVAL'.encode('latin1')
>>> print(s)
b'PR\xc3\x86KVAL'

Now s is a properly encoded byte object, so you can decode it at will:

>>> bb = s.decode('utf-8')
>>> print(bb)
PRÆKVAL

Done!

查看更多
登录 后发表回答