I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but I am not sure how to do that from within my spider class.
scrapy.http.Request relevant line:
fp.update(canonicalize_url(request.url))
canonicalize_url is from scrapy.utils.url, relevant line in scrapy.utils.url:
path = safe_url_string(_unquotepath(path)) or '/'
safe_url_string() is from w3lib.url, relevant lines in w3lib.url:
_ALWAYS_SAFE_BYTES = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-')
within w3lib.url.safe_url_string():
_safe_chars = _ALWAYS_SAFE_BYTES + b'%' + _reserved + _unreserved_marks
return moves.urllib.parse.quote(s, _safe_chars)
I wanted to not to encode
[
and]
and I did this.When creating a
Request
object scrapy applies some url encoding methods. To revert these you can utilize a custom middleware and change the url to your needs.You could use a
Downloader Middleware
like this:Don't forget to "activate" the middleware in
settings.py
like so:My project is named
so
and in the folder there is a filemiddlewares.py
. You need to adjust those to your environment.Credit goes to: Frank Martin