I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302
But I don't know what is the page on the web site that contains these information.
For example this web site:
http://www.ppgcc.ufv.br/
The links that I want are on this page:
http://www.ppgcc.ufv.br/?page_id=697
What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you
1 EDIT----
I need search on all pages of the main (ppgcc.ufv.br) site the kind of links (http://lattes.cnpq.br/asequenceofnumber)
. My Objective is get all the links lattes.cnpq.br/numbers but I don't know where they are. I'm using a simple code actually like:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["ppgcc.ufv.br"]
start_urls = (
'http://www.ppgcc.ufv.br/',
)
rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse')]
def parse(self, response):
filename = str(random.randint(1, 9999))
open(filename, 'wb').write(response.body)
#I'm trying to understand how to use rules correctly
2 EDIT----
Using:
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = [".ppgcc.ufv.br"]
start_urls = (
'http://www.ppgcc.ufv.br/',
)
rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse_links')]
def parse_links(self, response):
filename = "Lattes.txt"
arquivo = open(filename, 'wb')
extractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
for link in extractor.extract_links(response):
url = link.urlextractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
arquivo.writelines("%s\n" % url)
print url
It shows me:
C:\Python27\Scripts\tutorial3>scrapy crawl example
2015-06-02 08:08:18-0300 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial3)
2015-06-02 08:08:18-0300 [scrapy] INFO: Optional features available: ssl, http11
2015-06-02 08:08:18-0300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial3.spiders', 'SPIDER_MODULES': ['tutorial3
.spiders'], 'BOT_NAME': 'tutorial3'}
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMidd
leware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMidd
leware, ChunkedTransferMiddleware, DownloaderStats
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe
ngthMiddleware, DepthMiddleware
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled item pipelines:
2015-06-02 08:08:19-0300 [example] INFO: Spider opened
2015-06-02 08:08:19-0300 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-02 08:08:19-0300 [example] DEBUG: Crawled (200) <GET http://www.ppgcc.ufv.br/> (referer: None)
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.cgu.gov.br': <GET http://www.cgu.gov.br/acessoainformacao
gov/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.brasil.gov.br': <GET http://www.brasil.gov.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.ppgcc.ufv.br': <GET http://www.ppgcc.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.ufv.br': <GET http://www.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.dpi.ufv.br': <GET http://www.dpi.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.portal.ufv.br': <GET http://www.portal.ufv.br/?page_id=84
>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.wordpress.org': <GET http://www.wordpress.org/>
2015-06-02 08:08:19-0300 [example] INFO: Closing spider (finished)
2015-06-02 08:08:19-0300 [example] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 215,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 18296,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 912000),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 7,
'offsite/filtered': 42,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 528000)}
2015-06-02 08:08:19-0300 [example] INFO: Spider closed (finished)
And I was looking the source code of the site, there are more links of pages that the crawl didn't GET, maybe my rules are incorrect
So, a couple things first:
1) the
rules
attribute only works if you're extending theCrawlSpider
class, they won't work if you extend the simplerscrapy.Spider
.2) if you go the
rules
andCrawlSpider
route, you should not override the defaultparse
callback, because the default implementation is what actually calls the rules -- so you want to use another name for your callback.3) to do the actual extraction of the links you want, you can use a
LinkExtractor
inside your callback to scrape the links from the page:I hope it helps.