I'm struggling with Scrapy and I don't understand how exactly passing items between callbacks works. Maybe somebody could help me.
I'm looking into http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I'm trying to understand flow of actions there, step by step:
[parse_page1]
item = MyItem()
<- object item is created
item['main_url'] = response.url
<- we are assigning value to main_url of object item
request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2)
<- we are requesting a new page and launching parse_page2 to scrap it.
[parse_page2]
item = response.meta['item']
<- I don't understand here. We are creating a new object item or this is the object item created in [parse_page1]? And what response.meta['item'] does mean? We pass to the request in 3 only information like link and callback we didn't add any additional arguments to which we could refer ...
item['other_url'] = response.url
<- we are assigning value to other_url of object item
return item
<- we are returning item object as a result of request
[parse_page1]
request.meta['item'] = item
<- We are assigning object item to request? But request is finished, callback already returned item in 6 ????
return request
<- we are getting results of request, so item from 6, am I right?
I went through all documentation concerning scrapy and request/response/meta but still I don't understand what is happening here in points 4 and 7.
line 4: request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
line 5: request.meta['item'] = item
line 6: return request
You are confused about the previous code, let me explain it (I enumerated to explain it here):
In line 4, you are instantiating a scrapy.Request
object, this doesn't work like other other requests libraries, here you are not calling the url, and not going to the callback function just yet.
You are adding arguments to the scrapy.Request
object in line 5, so for example you could also declare the scrapy.Request
object like:
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
and you could have avoided line 5.
Is in line 6 when you are calling the scrapy.Request
object, and when scrapy
is making it work, like calling the url specified, going to the following callback, and passing meta
with it, you coul have also avoided line 6 (and line 5) if you would have called the request like this:
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
So the idea here is that your callback methods should return
(preferably yield
) a Request
or and Item
, scrapy will output the Item
and continue crawling the Request
.
@eLRuLL's answer is wonderful. I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item'] #response.meta is equal to request.meta,so here item['main_url'] ="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recordind urls of page1 and page2.