I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:
The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?
Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...
BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):
|-------------------------------------------------|
| Title | Due Date |
|-------------------------------------------------|
| Job Title (Clickable Link) | 1/1/2012 |
| Other Job (Link) | 3/2/2012 |
|--------------------------------|----------------|
I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.
Please, first read the docs to understand what i say.
The answer:
To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta
parameter.
how do i merge results from target page to current page in scrapy?
An example from scrapy documentation
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You can also use Python functools.partial
to pass an item
or any other serializable data via additional arguments to the next Scrapy callback.
Something like:
import functools
# Inside your Spider class:
def parse(self, response):
# ...
# Process the first response here, populate item and next_url.
# ...
callback = functools.partial(self.parse_next, item, someotherarg)
return Request(next_url, callback=callback)
def parse_next(self, item, someotherarg, response):
# ...
# Process the second response here.
# ...
return item