I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
- How is the
item
populated? - Does the
request.meta
line executes before theresponse.meta
line inparse_page2
? - Where is the returned
item
fromparse_page2
going? - What is the need of the
return request
statement inparse_page1
? I thought the extracted items need to be returned from here.
parse_page1
and avoid the extra http request callin scrapy: understanding how do items and requests work between callbacks ,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
[parse_page2] is the callback of
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
Read the docs:
Answers:
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in
start_urls
and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns anItem
, the item is passed to a pipeline to save the scraped data.As stated in the docs, each callback (both
parse_page1
andparse_page2
) can return either aRequest
or anItem
(or an iterable of them).parse_page1
returns aRequest
not theItem
, because additional info needs to be scraped from additional URL. Second callbackparse_page2
returns an item, because all the info is scraped and ready to be passed to a pipeline.