This is my first attempt to create a spider, kindly spare me if I have not done it properly.
Here is the link to the website I am trying to extract data from. http://www.4icu.org/in/. I want the entire list of colleges that is being displayed on the page. But when I run the following spider I am returned with an empty json file.
my items.py
import scrapy
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
This is the spider
colleges.py
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
return Request(
url = "http://www.4icu.org/in/",
callback = self.parse_fixtures
)
def parse_fixtures(self,response):
sel = response.selector
for div in sel.css("col span_2_of_2>div>tbody>tr"):
item = Fixture()
item['university.name'] = tr.xpath('td[@class="i"]/span /a/text()').extract()
yield item
As stated in the comment for the question there are some issues with your code.
First of all, you do not need two methods -- because in the parse
method you call the same URL as you did in start_urls
.
To get some information from the site try using the following code:
def parse(self, response):
for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
name = tr.xpath('./td[1]/a/text()').extract()[0]
location = tr.xpath('./td[2]//text()').extract()[0]
print name, location
and adjust it to your needs to fill your item (or items).
As you can see, your browser displays an additional tbody
in the table
which is not present when you scrape with Scrapy. This means you often need to judge what you see in the browser.
Here is the working code
import scrapy
from scrapy.spider import Spider
from scrapy.http import Request
class CollegesItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
location = scrapy.Field()
class CollegesSpider(Spider):
name = 'colleges'
allowed_domains = ["4icu.org"]
start_urls = ('http://www.4icu.org/in/',)
def parse(self, response):
for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
if tr.xpath(".//td[@class='i']"):
item = CollegesItem()
item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
item['location'] = tr.xpath('./td[2]//text()').extract()[0]
yield item
after running the command
spider
>>scrapy crawl colleges -o mait.json
Following is the snippet of results:
[[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
{"name": "Indian Institute of Technology Madras", "location": "Chennai"},
{"name": "University of Delhi", "location": "Delhi"},
{"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
{"name": "Anna University", "location": "Chennai"},
{"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
{"name": "Manipal University", "location": "Manipal ..."},
{"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
{"name": "Indian Institute of Science", "location": "Bangalore"},
{"name": "Panjab University", "location": "Chandigarh"},
{"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........