Scrapy crawl and follow links within href

2019-06-08 09:25发布

I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:

Initial Page

<div class="page-categories">
 <a class="menu"  href="/abc.html">
 <a class="menu"  href="/def.html">
</div>

Inside abc.html

<div class="cell category" >
 <div class="cell-text category">
 <p class="t">
  <a id="cat-24887" href="fgh.html"/>
</p>
</div>

I need to scrape the contents from this fgh.html page. Could anyone please suggest me where to start from. I read about Linkextractors but could not find a suitable reference to begin with. Thankyou

标签： python web-scraping scrapy scrapy-spider

1条回答

Ridiculous、

2楼-- · 2019-06-08 10:04

From what I see, I can say that:

URLs to product categories always end with .kat
URLs to products contain id_ followed by a set of digits

Let's use this information to define our spider rules:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class CodeCheckspider(CrawlSpider):
    name = "code_check"

    allowed_domains = ["www.codecheck.info"]
    start_urls = ['http://www.codecheck.info/']

    rules = [
        Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
        Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
    ]

    def parse_product(self, response):
        title = response.xpath('//title/text()').extract()[0]
        print title

In other words, we are asking spider to follow every category link and to let us know when it crawls a link containing id_ - which would mean for us that we found a product - in this case, for the sake of an example, I'm printing the page title on the console. This should give you a good starting point.

0人赞添加讨论(0) 举报

Scrapy crawl and follow links within href

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间