I am currently developing a scraper using Scrapy for the first time and I am using Yield for the first time as well. I am still trying to wrap my head around yield.
The Scraper:
- Scrapes one page to get a list of dates (parse)
- Uses these dates to format URLS to then scrape (parse_page_contents)
- On this page, it find URLS of each individual listing and scrapes the individual listings (parse_page_listings)
- On the individual list I want to extract all the data. There are also 4 links on each individual listing that contains even more data. (parse_individual_listings)
I am struggling to understand how to combine the JSON from parse_individual_tabs and parse_individual_listings into one JSON string. This will be one for each individual listing and will be sent to an API. Even just printing it for the time being will work.
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'',
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
url = ""
yield scrapy.Request(url, callback=self.parse_page_contents)
def parse_page_contents(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
urls = []
urls.append(response.url)
for link in pages.find_all('a'):
urls.append('/'.format(link['href']))
for url in urls:
yield scrapy.Request(url, callback=self.parse_page_listings)
def parse_page_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
resultTable = soup.find("table", { "class" : "apas_tbl" })
for row in resultTable.find_all('a'):
url = ""
yield scrapy.Request(url, callback=self.parse_individual_listings)
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
print field.label.text.strip()
print field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request( urlparse.urljoin(response.url, link['href']), callback=self.parse_individual_tabs)
To:
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
data = {}
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data}
)
print data
..
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
print json.dumps(data)
to
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
yield json.dumps(data)
Normally when obtaining data, you'll have to use
Scrapy Items
but they can also be replaced with dictionaries (which would be the JSON objects you are referring to), so we'll use them now:First, start creating the item (or dictionary) in the
parse_individual_listings
method, just as you did withdata
inparse_individual_tabs
. Then pass it to the next request (that will be caught byparse_individual_tabs
with themeta
argument, so it should look like:Then, you can get that data in
parse_individual_tabs
:Now the
data
inparse_individual_tabs
has all the information you want from both requests, you can do the same between any callback requests.