disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem
Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):
>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'
For me, markup should be assigned to a bytes, so I could do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>
Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>"
which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>"
?
Because if I do:
>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'
You see? It inserted \x83\xc2
for free.
>>> print(bytes(markup))
TypeError: string argument without an encoding
If you have the Unicode string
"<h1>Sacr\xc3\xa9 bleu!</h1>"
, something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert
"<h1>Sacr\xc3\xa9 bleu!</h1>"
tob"<h1>Sacr\xc3\xa9 bleu!</h1>"
anyway, then encode it in latin-1: