Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:

The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?

Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...

BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):

|-------------------------------------------------|
|             Title              |    Due Date    |
|-------------------------------------------------|
| Job Title (Clickable Link)     |    1/1/2012    |
| Other Job (Link)               |    3/2/2012    |
|--------------------------------|----------------|

I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.

标签： hyperlink callback scrapy

3条回答

神经病院院长

2楼-- · 2019-01-07 08:16

Please, first read the docs to understand what i say.

The answer:

To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter.

how do i merge results from target page to current page in scrapy?

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2019-01-07 08:19

An example from scrapy documentation

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                     callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

0人赞添加讨论(0) 举报

叛逆

4楼-- · 2019-01-07 08:24

You can also use Python functools.partial to pass an item or any other serializable data via additional arguments to the next Scrapy callback.

Something like:

import functools

# Inside your Spider class:

def parse(self, response):
  # ...
  # Process the first response here, populate item and next_url.
  # ...
  callback = functools.partial(self.parse_next, item, someotherarg)
  return Request(next_url, callback=callback)

def parse_next(self, item, someotherarg, response):
  # ...
  # Process the second response here.
  # ...
  return item

0人赞添加讨论(0) 举报

Scrapy: Follow link to get additional Item data?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间