scrapy spider not returning any results

2019-07-11 07:10发布

问题:

This is my first attempt to create a spider, kindly spare me if I have not done it properly. Here is the link to the website I am trying to extract data from. http://www.4icu.org/in/. I want the entire list of colleges that is being displayed on the page. But when I run the following spider I am returned with an empty json file. my items.py

    import scrapy
    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        link = scrapy.Field() 

This is the spider colleges.py

    import scrapy
    from scrapy.spider import Spider
    from scrapy.http import Request

    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        link = scrapy.Field()

    class CollegesSpider(Spider):
        name = 'colleges'
        allowed_domains = ["4icu.org"]
        start_urls = ('http://www.4icu.org/in/',)

        def parse(self, response):
            return Request(
                url = "http://www.4icu.org/in/",
                callback = self.parse_fixtures
            )
        def parse_fixtures(self,response):
            sel = response.selector
            for div in sel.css("col span_2_of_2>div>tbody>tr"):
                item = Fixture()
                item['university.name'] = tr.xpath('td[@class="i"]/span  /a/text()').extract()
                yield item

回答1:

As stated in the comment for the question there are some issues with your code.

First of all, you do not need two methods -- because in the parse method you call the same URL as you did in start_urls.

To get some information from the site try using the following code:

def parse(self, response):
    for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
        if tr.xpath(".//td[@class='i']"):
            name = tr.xpath('./td[1]/a/text()').extract()[0]
            location = tr.xpath('./td[2]//text()').extract()[0]
            print name, location

and adjust it to your needs to fill your item (or items).

As you can see, your browser displays an additional tbody in the table which is not present when you scrape with Scrapy. This means you often need to judge what you see in the browser.



回答2:

Here is the working code

    import scrapy
    from scrapy.spider import Spider
    from scrapy.http import Request

    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        name = scrapy.Field()
        location = scrapy.Field()
    class CollegesSpider(Spider):
        name = 'colleges'
        allowed_domains = ["4icu.org"]
        start_urls = ('http://www.4icu.org/in/',)

        def parse(self, response):
            for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
                if tr.xpath(".//td[@class='i']"):
                    item = CollegesItem()
                    item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
                    item['location'] = tr.xpath('./td[2]//text()').extract()[0]
                    yield item

after running the command spider

    >>scrapy crawl colleges -o mait.json

Following is the snippet of results:

    [[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
    {"name": "Indian Institute of Technology Madras", "location": "Chennai"},
    {"name": "University of Delhi", "location": "Delhi"},
    {"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
    {"name": "Anna University", "location": "Chennai"},
    {"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
    {"name": "Manipal University", "location": "Manipal ..."},
    {"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
    {"name": "Indian Institute of Science", "location": "Bangalore"},
    {"name": "Panjab University", "location": "Chandigarh"},
    {"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........