I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:
The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?
Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...
BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):
|-------------------------------------------------|
| Title | Due Date |
|-------------------------------------------------|
| Job Title (Clickable Link) | 1/1/2012 |
| Other Job (Link) | 3/2/2012 |
|--------------------------------|----------------|
I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.
Please, first read the docs to understand what i say.
The answer:
To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its
meta
parameter.how do i merge results from target page to current page in scrapy?
An example from scrapy documentation
You can also use Python
functools.partial
to pass anitem
or any other serializable data via additional arguments to the next Scrapy callback.Something like: