Scrapy - Scraping links found while scraping

2019-03-04 04:01发布

I can only presume this is one of the most basic things to do in Scrapy but I just cannot work out how to do it. Basically, I scrape one page to get a list of urls that contain updates for the week. I then need to go into these urls one by one and scrape the information from them. I currently have both scrapers set up and they work perfectly manually. So I first scrape the urls from the first scraper then hard code them as the start_urls[] on the second scraper.

What is the best way to do it? Is it as simple as calling another function in the scraper file that takes a list of urls and does the scraping there?

This is the scraper that gets the list of urls:

class MySpider(scrapy.Spider):
    name = "myspider"

    start_urls = [ .....
    ]


    def parse(self, response):
        rows = response.css('table.apas_tbl tr').extract()
        urls = []
        for row in rows[1:]:
            soup = BeautifulSoup(row, 'lxml')
            dates = soup.find_all('input')
        urls.append("http://myurl{}.com/{}".format(dates[0]['value'], dates[1]['value']))

This is the scraper that then goes through the urls one by one:

class Planning(scrapy.Spider):
    name = "planning"

    start_urls = [
       ...
    ]


    def parse(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        pages = soup.find(id='apas_form_text')
        for link in pages.find_all('a'):
            url = 'myurl.com/{}'.format(link['href'])

        resultTable = soup.find("table", { "class" : "apas_tbl" })

I then saved resultTable into a file. At the moment I take the output of the urls list and copy it into the other scraper.

标签: python scrapy
1条回答
Deceive 欺骗
2楼-- · 2019-03-04 04:53

For every link that you found with parse you can request it and parse the content with the other function:

class MySpider(scrapy.Spider):
    name = "myspider"

    start_urls = [ .....
    ]

    def parse(self, response):
        rows = response.css('table.apas_tbl tr').extract()
        urls = []
        for row in rows[1:]:
            soup = BeautifulSoup(row, 'lxml')
            dates = soup.find_all('input')
            url = "http://myurl{}.com/{}".format(dates[0]['value'], dates[1]['value'])
            urls.append(url)
            yield scrapy.Request(url, callback=self.parse_page_contents)

    def parse_page_contents(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        pages = soup.find(id='apas_form_text')
        for link in pages.find_all('a'):
            url = 'myurl.com/{}'.format(link['href'])

        resultTable = soup.find("table", { "class" : "apas_tbl" })
查看更多
登录 后发表回答