Constructing a regular expression for url in start

I am very new to scrapy and also i didn't used regular expressions before

The following is my spider.py code

class ExampleSpider(BaseSpider):
   name = "test_code
   allowed_domains = ["www.example.com"]
   start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

Now if we look at start_urls all the three urls are same except they differ at integer value 2?, 3? and so on i mean unlimited according to urls present on the site , i now that we can use crawlspider and we can construct regular expression for the URL like below,

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re

    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

        rules = (
            Rule(SgmlLinkExtractor(allow=(........),))),
        ) 

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

can u please guide me , that how can i construct a crawl spider Rule for the above start_url list.

标签： python scrapy

2条回答

Ridiculous、

2楼-- · 2019-03-01 07:53

If i understand you correctly, you want a lot of start URL with a certain pattern.

If so, you can override BaseSpider.start_requests method:

class ExampleSpider(BaseSpider):
    name = "test_code"
    allowed_domains = ["www.example.com"]

    def start_requests(self):
        for i in xrange(1000):
            yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)

    ...

0人赞添加讨论(0) 举报

ら.Afraid

3楼-- · 2019-03-01 08:11

If you are using CrawlSpider, it's not usually a good idea to override the parse method.

Rule object can filter the urls you are interesed to the ones you do not care for.

See CrawlSpider in the docs for reference.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re

class ExampleSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/bookstore']

    rules = (
        Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
    )

def parse_boostore(self, response):
   hxs = HtmlXPathSelector(response)

0人赞添加讨论(0) 举报

Constructing a regular expression for url in start

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间