I am trying to run my Base spider for dynamic pagination but I am not getting success in crawling. I have used selenium ajax dynamic pagination. the url I am using is: http://www.demo.com. Here is my code:
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import demoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
#strData = strData.decode('unicode_escape').encode('ascii','ignore')
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
#print 'Output:',strData
return strData
class demoSpider(scrapy.Spider):
name = "demourls"
allowed_domains = ["demo.com"]
start_urls = ['http://www.demo.com']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
print "*****************************************************"
self.driver.get(response.url)
print response.url
print "______________________________"
hxs = Selector(response)
item = sumItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
print '**********************************************2***url*****************************************',urls
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
except:
break
self.driver.close()
return item
my items.py is
from scrapy.item import Item, Field
class demoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
when I am trying to crawl it and convert it in json I am getting my json file as:
[{"pageurl": "http://www.demo.com", "urls": [], "title": "demo"}]
I am not able to crawl all urls as it is dynamically loading.
First of all, you don't need to press showMoreCars button as it will be dynamically pressed after page load. Instead, waiting for some second will be enough.
Apart from you
scrapy
code, selenium is able to capture allhref
s for you. Here is an example of what you need to do in selenium.all you need is to merge this with your scrapy part.
Output:
I hope the below code will help.
somespider.py
items.py
Note: You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.
Run your selenium rc server issuing the command :
Run your spider using command: