Python Scrapy & Yield

2019-05-07 04:29发布

I am currently developing a scraper using Scrapy for the first time and I am using Yield for the first time as well. I am still trying to wrap my head around yield.

The Scraper:

  • Scrapes one page to get a list of dates (parse)
  • Uses these dates to format URLS to then scrape (parse_page_contents)
  • On this page, it find URLS of each individual listing and scrapes the individual listings (parse_page_listings)
  • On the individual list I want to extract all the data. There are also 4 links on each individual listing that contains even more data. (parse_individual_listings)

I am struggling to understand how to combine the JSON from parse_individual_tabs and parse_individual_listings into one JSON string. This will be one for each individual listing and will be sent to an API. Even just printing it for the time being will work.

    class MySpider(scrapy.Spider):
    name = "myspider"

    start_urls = [
            '',
    ]


    def parse(self, response):
        rows = response.css('table.apas_tbl tr').extract()
        for row in rows[1:]:
            soup = BeautifulSoup(row, 'lxml')
            dates = soup.find_all('input')
            url = ""
            yield scrapy.Request(url, callback=self.parse_page_contents)

    def parse_page_contents(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        pages = soup.find(id='apas_form_text')
        urls = []
        urls.append(response.url)
        for link in pages.find_all('a'):
            urls.append('/'.format(link['href']))

        for url in urls:
             yield scrapy.Request(url, callback=self.parse_page_listings)

    def parse_page_listings(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        resultTable = soup.find("table", { "class" : "apas_tbl" })

        for row in resultTable.find_all('a'):
            url = ""
            yield scrapy.Request(url, callback=self.parse_individual_listings)


    def parse_individual_listings(self, response): 
        rows = response.xpath('//div[@id="apas_form"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            print field.label.text.strip()
            print field.p.text.strip()

        tabs = response.xpath('//div[@id="tabheader"]').extract_first() 
        soup = BeautifulSoup(tabs, 'lxml')
        links = soup.find_all("a")
       for link in links:
            yield scrapy.Request( urlparse.urljoin(response.url, link['href']), callback=self.parse_individual_tabs)

To:

def parse_individual_listings(self, response): 
    rows = response.xpath('//div[@id="apas_form"]').extract_first() 
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div',{'id':'fieldset_data'})
    data = {}
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    tabs = response.xpath('//div[@id="tabheader"]').extract_first() 
    soup = BeautifulSoup(tabs, 'lxml')
    links = soup.find_all("a")
    for link in links:
        yield scrapy.Request(
            urlparse.urljoin(response.url, link['href']), 
            callback=self.parse_individual_tabs,
            meta={'data': data}
        )
    print data

..

    def parse_individual_tabs(self, response): 
        data = {}
        rows = response.xpath('//div[@id="tabContent"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            data[field.label.text.strip()] = field.p.text.strip()

        print json.dumps(data)

to

def parse_individual_tabs(self, response): 
        data = {}
        rows = response.xpath('//div[@id="tabContent"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            data[field.label.text.strip()] = field.p.text.strip()

        yield json.dumps(data)

标签: python scrapy
1条回答
Lonely孤独者°
2楼-- · 2019-05-07 05:03

Normally when obtaining data, you'll have to use Scrapy Items but they can also be replaced with dictionaries (which would be the JSON objects you are referring to), so we'll use them now:

First, start creating the item (or dictionary) in the parse_individual_listings method, just as you did with data in parse_individual_tabs. Then pass it to the next request (that will be caught by parse_individual_tabs with the meta argument, so it should look like:

def parse_individual_listings(self, response):
    ...
    data = {}
    data[field1] = 'data1'
    data[field1] = 'data2'
    ...
    yield scrapy.Request(
        urlparse.urljoin(response.url, link['href']), 
        callback=self.parse_individual_tabs,
        meta={'data': data};
    )

Then, you can get that data in parse_individual_tabs:

def parse_individual_tabs(self, response):
    data = response.meta['data']
    ...
    # keep populating `data`
    yield data

Now the data in parse_individual_tabs has all the information you want from both requests, you can do the same between any callback requests.

查看更多
登录 后发表回答