Python3, how to encode this string correctly?

2019-08-23 11:04发布

问题:

disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem

Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):

>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'

For me, markup should be assigned to a bytes, so I could do:

>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>

Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>" which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>" ?

Because if I do:

>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'

You see? It inserted \x83\xc2 for free.

>>> print(bytes(markup))
TypeError: string argument without an encoding

回答1:

If you have the Unicode string "<h1>Sacr\xc3\xa9 bleu!</h1>", something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.

If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert "<h1>Sacr\xc3\xa9 bleu!</h1>" to b"<h1>Sacr\xc3\xa9 bleu!</h1>" anyway, then encode it in latin-1:

bytestring = broken_unicode.encode('latin1')