I have a seed url (say DOMAIN/manufacturers.php
) with no pagination that looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="st-text">
<table cellspacing="6" width="600">
<tr>
<td>
<a href="manufacturer1-type-59.php"></a>
</td>
<td>
<a href="manufacturer1-type-59.php">Name 1</a>
</td>
<td>
<a href="manufacturer2-type-5.php"></a>
</td>
<td>
<a href="manufacturer2-type-5.php">Name 2</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer3-type-88.php"></a>
</td>
<td>
<a href="manufacturer3-type-88.php">Name 3</a>
</td>
<td>
<a href="manufacturer4-type-76.php"></a>
</td>
<td>
<a href="manufacturer4-type-76.php">Name 4</a>
</td>
</tr>
<tr>
<td>
<a href="manufacturer5-type-28.php"></a>
</td>
<td>
<a href="manufacturer5-type-28.php">Name 5</a>
</td>
<td>
<a href="manufacturer6-type-48.php"></a>
</td>
<td>
<a href="manufacturer6-type-48.php">Name 6</a>
</td>
</tr>
</table>
</div>
</body>
</html>
From there I would like to get all a['href'] 's
, for example: manufacturer1-type-59.php
. Note that these links do NOT contain the DOMAIN
prefix already so my guess is that I have to add it somehow, or maybe not?
Optionally, I would like to keep the links both in memory
(for the very next phase) and also save them to disk
for future reference.
The content of each of these links, such as manufacturer1-type-59.php
, looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="makers">
<ul>
<li>
<a href="manufacturer1_model1_type1.php"></a>
</li>
<li>
<a href="manufacturer1_model1_type2.php"></a>
</li>
<li>
<a href="manufacturer1_model2_type3.php"></a>
</li>
</ul>
</div>
<div class="nav-band">
<div class="nav-items">
<div class="nav-pages">
<span>Pages:</span><strong>1</strong>
<a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
<a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
<a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
</div>
</div>
</div>
</body>
</html>
Next, I would like to get all a['href'] 's
, for example manufacturer_model1_type1.php
. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php
redirects to manufacturer-type-STRING-59-INT-p2.php
.
Optionally, I would also like to keep the links both in memory
(for the very next phase) and also save them to disk
for future reference.
The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php
, extract the title, and save result in a file in the following form: (url, title, ).
EDIT
This is what I have done so far but doesn't seem to work...
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ArchiveItem(scrapy.Item):
url = scrapy.Field()
class ArchiveSpider(CrawlSpider):
name = 'gsmarena'
allowed_domains = ['gsmarena.com']
start_urls = ['http://www.gsmarena.com/makers.php3']
rules = [
Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
]
def parse_archive(self, response):
torrent = ArchiveItem()
torrent['url'] = response.url
return torrent
I think you better use BaseSpider instead of CrawlSpider
this code might help
EDIT
if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_marker then the request will look like