Python: Convert Raw String to Bytes String without

2019-06-06 04:31发布

问题:

I have a string:

'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

And I want:

b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

But I keep getting:

b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'

Context

I scraped a string off of a webpage and stored it in the variable un. Now I want to decompress it using BZip2:

bz2.decompress(un)

However, since un is a str object, I get this error:

TypeError: a bytes-like object is required, not 'str'

Therefore, I need to convert un to a bytes-like object without changing the single backslash to an escaped backslash.

Edit 1: Thank you for all the help! @wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:

r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')

doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]

pattern = re.compile("[a-z]{2}: '(.+)'")

un = re.search(pattern, comment[0]).group(1)

The packages that I am using are requests, lxml.html, re, and bz2.

Once again, my goal is to decompress un using bz2, but I am having difficulty getting a bytes-like object from my webscraping process.

Any pointers?

回答1:

Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un into bytes, it can not be done reliably.

Do NOT do this:

>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'

The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.

The web scraping code has no business returning bz2-encoded bytes as a str, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.



回答2:

If I understand your goal correctly, this can be achieved by:

word = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

my_byte_array = word.encode()

print(my_byte_array)

The result came out to be:

b'BZh91AY&SYA\xc2\xaf\xc2\x82\r\x00\x00\x01\x01\xc2\x80\x02\xc3\x80\x02\x00 \x00!\xc2\x9ah3M\x07<]\xc3\x89\x14\xc3\xa1BA\x06\xc2\xbe\x084'

There is a good discussion about this on this SO post if this isn't enough. They talk about the best ways (according to PEP) to encode UTF-8 Strings to byte arrays and other methods the class involves.