scrapy: understanding how do items and requests wo

2020-06-23 08:21发布

问题:

I'm struggling with Scrapy and I don't understand how exactly passing items between callbacks works. Maybe somebody could help me.

I'm looking into http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

I'm trying to understand flow of actions there, step by step:

[parse_page1]

  1. item = MyItem() <- object item is created
  2. item['main_url'] = response.url <- we are assigning value to main_url of object item
  3. request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) <- we are requesting a new page and launching parse_page2 to scrap it.

[parse_page2]

  1. item = response.meta['item'] <- I don't understand here. We are creating a new object item or this is the object item created in [parse_page1]? And what response.meta['item'] does mean? We pass to the request in 3 only information like link and callback we didn't add any additional arguments to which we could refer ...
  2. item['other_url'] = response.url <- we are assigning value to other_url of object item
  3. return item <- we are returning item object as a result of request

[parse_page1]

  1. request.meta['item'] = item <- We are assigning object item to request? But request is finished, callback already returned item in 6 ????
  2. return request <- we are getting results of request, so item from 6, am I right?

I went through all documentation concerning scrapy and request/response/meta but still I don't understand what is happening here in points 4 and 7.

回答1:

line 4: request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
line 5: request.meta['item'] = item
line 6: return request

You are confused about the previous code, let me explain it (I enumerated to explain it here):

  1. In line 4, you are instantiating a scrapy.Request object, this doesn't work like other other requests libraries, here you are not calling the url, and not going to the callback function just yet.

  2. You are adding arguments to the scrapy.Request object in line 5, so for example you could also declare the scrapy.Request object like:

    request = scrapy.Request("http://www.example.com/some_page.html", 
            callback=self.parse_page2, meta={'item': item})`
    

    and you could have avoided line 5.

  3. Is in line 6 when you are calling the scrapy.Request object, and when scrapy is making it work, like calling the url specified, going to the following callback, and passing meta with it, you coul have also avoided line 6 (and line 5) if you would have called the request like this:

    return scrapy.Request("http://www.example.com/some_page.html", 
            callback=self.parse_page2, meta={'item': item})`
    

So the idea here is that your callback methods should return (preferably yield) a Request or and Item, scrapy will output the Item and continue crawling the Request.



回答2:

@eLRuLL's answer is wonderful. I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.

in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".

[parse_page1] is the callback of

scrapy.Request("http://www.example.com.html",callback=parse_page1)`

[parse_page2] is the callback of

scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)

when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:

item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
request.meta['item'] = item  # store item in request.meta

after the response of page2 is downloaded, the parse_page2 is called to retrun a item:

item = response.meta['item'] #response.meta is equal to request.meta,so here item['main_url'] ="http://www.example.com.html".

item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"

return item #finally,we get the item recordind  urls of page1 and page2.


标签: python scrapy