Scrapy: ValueError('Missing scheme in request

2019-06-02 11:35发布

I am trying to scrape data from a webpage. The webpage is simply a bullet list of 2500 URLs. Scrapy fetch and goes to each and every URL and fetch some data ...

Here is my code

class MySpider(CrawlSpider):
    name = 'dknews'
    start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
    allowed_domains = ['example.org']

    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"dkpagetype"})
        ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
        pturl = soup.find_all(attrs={"name":"dkpageurl"})
        ptdate = soup.find_all(attrs={"name":"dkpagedate"})
        ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
         for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
           ptbody = ''.join(node.find_all(text=True))  
           ptbody = ' '.join(ptbody.split())
           nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
           nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
           nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
           nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
           nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
           nf['bodytext'] = ptbody.encode('ascii', 'ignore')
         yield nf
            for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
             yield Request(url, callback=self.parse)

Now the problem is that the above code scrapes around 215 out of 2500 articles. It closes by giving this error ...

ValueError('Missing scheme in request url: %s' % self._url)

I have no idea what is causing this error ....

Any help is very appreciated.

Thanks

标签: python scrapy
1条回答
乱世女痞
2楼-- · 2019-06-02 11:58

Update 01/2019

Nowdays Scrapy's Response instance has pretty convenient method response.follow which generates Request from the given URL (either absolute or relative or even Link object generated by LinkExtractor) using response.url as the base:

yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)

Docs: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow


Code below looks like the issue:

 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
     yield Request(url, callback=self.parse)

if any of urls is not fully qualified, e.g. looks like href="/path/to/page" rather than href="http://example.com/path/to/page" you'll get the error. To ensure you yielding correct requests you can use urljoin:

    yield Request(response.urljoin(url), callback=self.parse)

Scrapy way is to use LinkExtractor though https://doc.scrapy.org/en/latest/topics/link-extractors.html

查看更多
登录 后发表回答