Extract 3-level content from paginated pages with

2019-06-08 20:54发布

I have a seed url (say DOMAIN/manufacturers.php) with no pagination that looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="st-text">
        <table cellspacing="6" width="600">
            <tr>
                <td>
                    <a href="manufacturer1-type-59.php"></a>
                </td>

                <td>
                    <a href="manufacturer1-type-59.php">Name 1</a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php"></a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php">Name 2</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer3-type-88.php"></a>
                </td>

                <td>
                    <a href="manufacturer3-type-88.php">Name 3</a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php"></a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php">Name 4</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer5-type-28.php"></a>
                </td>

                <td>
                    <a href="manufacturer5-type-28.php">Name 5</a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php"></a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php">Name 6</a>
                </td>
            </tr>
        </table>
    </div>
</body>
</html>

From there I would like to get all a['href'] 's, for example: manufacturer1-type-59.php. Note that these links do NOT contain the DOMAIN prefix already so my guess is that I have to add it somehow, or maybe not?

Optionally, I would like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The content of each of these links, such as manufacturer1-type-59.php, looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="makers">
        <ul>
            <li>
                <a href="manufacturer1_model1_type1.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model1_type2.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model2_type3.php"></a>
            </li>
        </ul>
    </div>

    <div class="nav-band">
        <div class="nav-items">
            <div class="nav-pages">
                <span>Pages:</span><strong>1</strong>
                <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
                <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
                <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
            </div>
        </div>
    </div>
</body>
</html>

Next, I would like to get all a['href'] 's, for example manufacturer_model1_type1.php. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php redirects to manufacturer-type-STRING-59-INT-p2.php.

Optionally, I would also like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php, extract the title, and save result in a file in the following form: (url, title, ).

EDIT

This is what I have done so far but doesn't seem to work...

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ArchiveItem(scrapy.Item):
    url = scrapy.Field()

class ArchiveSpider(CrawlSpider):
    name = 'gsmarena'
    allowed_domains = ['gsmarena.com']
    start_urls = ['http://www.gsmarena.com/makers.php3']
    rules = [
        Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
        Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
        Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
    ]

    def parse_archive(self, response):
        torrent = ArchiveItem()
        torrent['url'] = response.url
        return torrent

标签: scrapy
1条回答
干净又极端
2楼-- · 2019-06-08 21:54

I think you better use BaseSpider instead of CrawlSpider

this code might help

class GsmArenaSpider(Spider):
    name = 'gsmarena'
    start_urls = ['http://www.gsmarena.com/makers.php3', ]
    allowed_domains = ['gsmarena.com']
    BASE_URL = 'http://www.gsmarena.com/'

def parse(self, response):
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
    if markers:
        for marker in markers:
            yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)

def parse_marker(self, response):
    url = response.url
    # extracting phone urls
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
    if phones:
        for phone in phones:
            # change callback function name as parse_events for first crawl
            yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
    else:
        return

    # pagination
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
    if next_page:
        yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)

def parse_phone(self, response):
    # extract whatever stuffs you want and yield items here
    pass

EDIT

if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_marker then the request will look like

 yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})
查看更多
登录 后发表回答