Multiple nested request with scrapy

2019-02-20 03:08发布

I try to scrap some airplane schedule information on www.flightradar24.com website for research project.

The hierarchy of json file i want to obtain is something like that :

Object ID
 - country
   - link
   - name
   - airports
     - airport0 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...
      - airport1 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...

Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store multiple AirportItem (code_total, link, lat, lon, name, schedule) :

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    schedule = scrapy.Field()

Here my scrapy code AirportsSpider to do that :

class AirportsSpider(scrapy.Spider):
    name = "airports"
    start_urls = ['https://www.flightradar24.com/data/airports']
    allowed_domains = ['flightradar24.com']

    def clean_html(self, html_text):
        soup = BeautifulSoup(html_text, 'html.parser')
        return soup.get_text()

    rules = [
    # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
    ]


    def parse(self, response):
        count_country = 0
        countries = []
        for country in response.xpath('//a[@data-country]'):
            if count_country > 5:
                break
            item = CountryItem()
            url =  country.xpath('./@href').extract()
            name = country.xpath('./@title').extract()
            item['link'] = url[0]
            item['name'] = name[0]
            count_country += 1
            countries.append(item)
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

    def parse_airports(self,response):
        item = response.meta['my_country_item']
        airports = []

        for airport in response.xpath('//a[@data-iata]'):
            url = airport.xpath('./@href').extract()
            iata = airport.xpath('./@data-iata').extract()
            iatabis = airport.xpath('./small/text()').extract()
            name = ''.join(airport.xpath('./text()').extract()).strip()
            lat = airport.xpath("./@data-lat").extract()
            lon = airport.xpath("./@data-lon").extract()

            iAirport = AirportItem()
            iAirport['name'] = self.clean_html(name)
            iAirport['link'] = url[0]
            iAirport['lat'] = lat[0]
            iAirport['lon'] = lon[0]
            iAirport['code_little'] = iata[0]
            iAirport['code_total'] = iatabis[0]

            airports.append(iAirport)

        for airport in airports:
            json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
            yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)

        item['airports'] = airports

        yield {"country" : item}

    def parse_schedule(self,response):

        item = response.request.meta['airport_item']
        jsonload = json.loads(response.body_as_unicode())
        json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
        item['schedule'] = json_expression.search(jsonload)

Explanation :

  • In my first parse, i call a request on for each country link i found whith the CountryItem created via meta={'my_country_item':item}. Each of these request callback self.parse_airports

  • In my second level of parse parse_airports, i catch CountryItem created using item = response.meta['my_country_item'] and i create a new item iAirport = AirportItem() for each airport i found into this country page. Now i want to get schedule information for each AirportItem created and stored in airports list.

  • In the second level of parse parse_airports, i run a for loop on airports to catch schedule information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}. The callback of this request run parse_schedule

  • In the third level of parse parse_schedule, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']

But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport), but not country > list of (airport > schedule )

enter image description here

My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping

1条回答
唯我独甜
2楼-- · 2019-02-20 03:39

The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:

def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()

        iAirport = dict()
        iAirport['name'] = 'foobar'
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]
        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
            code=airport['code_little'], timestamp="1484150483")
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return Request(next_url, self.parse_schedule,
                   meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

def parse_schedule(self, response):
    """we want to loop this continuously for every schedule item"""
    item = response.meta['airport_item']
    i = response.meta['i']
    urls = response.meta['airport_urls']

    jsonload = json.loads(response.body_as_unicode())
    item['airports'][i]['schedule'] = 'foobar'
    # now do next schedule items
    if not urls:
        yield item
        return
    url = urls.pop()
    yield Request(url, self.parse_schedule,
                  meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
查看更多
登录 后发表回答