Understanding callbacks in Scrapy

I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

I am unable to understand following things:

How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.

标签： python callback scrapy

3条回答

来，给爷笑一个

2楼-- · 2019-03-08 09:26

yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call

0人赞添加讨论(0) 举报

神经病院院长

3楼-- · 2019-03-08 09:43

in scrapy: understanding how do items and requests work between callbacks ,eLRuLL's answer is wonderful.

I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.

in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".

[parse_page1] is the callback of

scrapy.Request("http://www.example.com.html",callback=parse_page1)`

[parse_page2] is the callback of

scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)

when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:

item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
request.meta['item'] = item  # store item in request.meta

after the response of page2 is downloaded, the parse_page2 is called to retrun a item:

item = response.meta['item'] 
#response.meta is equal to request.meta,so here item['main_url'] 
#="http://www.example.com.html".

item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"

return item #finally,we get the item recording  urls of page1 and page2.

0人赞添加讨论(0) 举报

聊天终结者

4楼-- · 2019-03-08 09:44

Read the docs:

For spiders, the scraping cycle goes through something like this:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

Answers:

How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?

Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.

Where is the returned item from parse_page2 going?

What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?

As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.

0人赞添加讨论(0) 举报

Understanding callbacks in Scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间