I can only presume this is one of the most basic things to do in Scrapy but I just cannot work out how to do it. Basically, I scrape one page to get a list of urls that contain updates for the week. I then need to go into these urls one by one and scrape the information from them. I currently have both scrapers set up and they work perfectly manually. So I first scrape the urls from the first scraper then hard code them as the start_urls[] on the second scraper.
What is the best way to do it? Is it as simple as calling another function in the scraper file that takes a list of urls and does the scraping there?
This is the scraper that gets the list of urls:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [ .....
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
urls = []
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
urls.append("http://myurl{}.com/{}".format(dates[0]['value'], dates[1]['value']))
This is the scraper that then goes through the urls one by one:
class Planning(scrapy.Spider):
name = "planning"
start_urls = [
...
]
def parse(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
for link in pages.find_all('a'):
url = 'myurl.com/{}'.format(link['href'])
resultTable = soup.find("table", { "class" : "apas_tbl" })
I then saved resultTable into a file. At the moment I take the output of the urls list and copy it into the other scraper.
For every link that you found with parse you can request it and parse the content with the other function: