How can I get the request url in Scrapy's parse()
function? I have a lot of urls in start_urls
and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url
to store these urls. I'm using the BaseSpider.
相关问题
- Flush single app django 1.9
- How to stop a dbus gobject loop
- scrapy get nth-child text of same class
- Python's difflib SequenceMatcher speed up
- Getting 'Missing required field: member' w
相关文章
- scrapy框架怎么用啊
- Is there a size limit for HTTP response headers on
- Does there exist empty class in python?
- ImportError: No module named twisted.persisted.sty
- Get a header with Python and convert in JSON (requ
- python unit testing methods inside of classes
- Scrapy - Select specific link based on text
- Requiring tensorflow with Python 2.7.11 occurs Imp
Instead of storing requested URL's somewhere and also scrapy processed URL's are not in same sequence as provided in
start_urls
.By using below,
will give you the list of redirect happened like
['http://requested_url','https://redirected_url','https://final_redirected_url']
To access first URL from above list, you can use
For more, see doc.scrapy.org mentioned as :
RedirectMiddleware
The urls which the request goes through (while being redirected) can be found in the
redirect_urls
Request.meta key.Hope this helps you
The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything.
eg. (EDITED)
You need to override BaseSpider's
make_requests_from_url(url)
function to assign the start_url to the item and then use theRequest.meta
special keys to pass that item to theparse
functionHope that helps.
Python 3.5
Scrapy 1.5.0
The request object is accessible from the response object, therefore you can do the following: