I'm trying to update a database full of links to external websites, for some reason, it's skipping the callback when the request headers/website/w/e is moved/301 flag
def start_requests(self):
#... database stuff
for x in xrange(0, numrows):
row = cur.fetchone()
item = exampleItem()
item['real_id'] = row[0]
item['product_id'] = row[1]
url = "http://www.example.com/a/-" + item['real_id'] + ".htm"
log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right
request = scrapy.Request(url, callback=self.parse_url)
request.meta['item'] = item
yield request
def parse_url(self, response):
item = response.meta['item']
item['real_url'] = response.url
log.msg("item %d new URL is %s" % (item['product_id'], item['real_url']), log.INFO) #doesnt even show the items that have redirected.
Scrapy version is 0.24, what can I do?
Interesting fact: It only happens with some of the broken links, even if they are from the same website with the exact same urls, etc.
Had to pass the
parameter to the Response callback function