Scrapy Linkextractor duplicating(?)

2019-01-12 07:27发布

问题:

I have the crawler implemented as below.

It is working and it would go through sites regulated under the link extractor.

Basically what I am trying to do is to extract information from different places in the page:

- href and text() under the class 'news' ( if exists)

- image url under the class 'think block' ( if exists)

I have three problems for my scrapy:

1) duplicating linkextractor

It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)

And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).

I suspect link extractor would loop the same page over again in this case.

( maybe solve with depth_limit?)

2) Improving parse_item

I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?

3) I need the spiders to be able to read Chinese.

I am crawling a site in HK and it would be essential to be capable to read Chinese.

The site:

http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82

As long as it belongs to 'dailynews', that's the thing I want.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items


class EconjournalSpider(CrawlSpider):
    name = "econJournal"
    allowed_domains = ["hkej.com"]
    login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
    start_urls =  'http://www.hkej.com/dailynews'

    rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
           )


    def start_requests(self):
         yield Request(
         url=self.login_page,
         callback=self.login,
         dont_filter=True
         )
# name column
    def login(self, response):
        return FormRequest.from_response(response,
                    formdata={'name': 'users', 'password': 'my password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "username" in response.body:       
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            return Request(url=self.start_urls)
        else:
            self.log("\n\n\nYou are not logged in.\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens

    def parse_item(self, response):
        hxs = Selector(response)
        news=hxs.xpath("//div[@class='news']")
        images=hxs.xpath('//p')

        for image in images:
            allimages=items.HKejImage()
            allimages['image'] = image.xpath('a/img[not(@data-original)]/@src').extract()
            yield allimages

        for new in news:
            allnews = items.HKejItem()
            allnews['news_title']=new.xpath('h2/@text()').extract()
            allnews['news_url'] = new.xpath('h2/@href').extract()
            yield allnews

Thank you very much and I would appreciate any help!

回答1:

First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:

custom_settings = {
    'DEPTH_LIMIT': 3,
}

Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.

First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.

Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:

images=hxs.xpath('//img')

and then to get the image url:

allimages['image'] = image.xpath('./@src').extract()

for the news, it looks like this could work:

allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/@href').extract()

Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.