I have a string:
'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
And I want:
b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
But I keep getting:
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
Context
I scraped a string off of a webpage and stored it in the variable un
. Now I want to decompress it using BZip2:
bz2.decompress(un)
However, since un
is a str
object, I get this error:
TypeError: a bytes-like object is required, not 'str'
Therefore, I need to convert un
to a bytes-like object without changing the single backslash to an escaped backslash.
Edit 1:
Thank you for all the help!
@wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:
r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]
pattern = re.compile("[a-z]{2}: '(.+)'")
un = re.search(pattern, comment[0]).group(1)
The packages that I am using are requests
, lxml.html
, re
, and bz2
.
Once again, my goal is to decompress un
using bz2
, but I am having difficulty getting a bytes-like object from my webscraping process.
Any pointers?
Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un
into bytes, it can not be done reliably.
Do NOT do this:
>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'
The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.
The web scraping code has no business returning bz2-encoded bytes as a str
, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.
If I understand your goal correctly, this can be achieved by:
word = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
my_byte_array = word.encode()
print(my_byte_array)
The result came out to be:
b'BZh91AY&SYA\xc2\xaf\xc2\x82\r\x00\x00\x01\x01\xc2\x80\x02\xc3\x80\x02\x00 \x00!\xc2\x9ah3M\x07<]\xc3\x89\x14\xc3\xa1BA\x06\xc2\xbe\x084'
There is a good discussion about this on this SO post if this isn't enough. They talk about the best ways (according to PEP) to encode UTF-8 Strings to byte arrays and other methods the class involves.