I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php".
However the data is classified in form of tables and table rows. The data is of the format:
table header > td[@class='ttl'] > td[@class='nfo']
Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file:
import scrapy
class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass
Spider file:
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)
def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
phone['phoneDetails'] = "".join(seq)
yield phone
However, I'm getting banned as soon as I try to even load the page in scrapy shell using:
"http://www.gsmarena.com/htc_one_me-7275.php"
I've even tried using DOWNLOAD_DELAY = 3 in settings.py.
Kindly suggest how should I go about it.
I also faced the same problem of getting banned within few requests, changing proxies using scrapy-proxies and using autothrottling helped significantly, but did not solve the problem completely.
You can find my code at gsmarenacrawler
The idea would be to iterate over all
table
elements inside the "spec-list", get theth
element for the block name, get all thetd
elements withclass="ttl"
and corresponding followingtd
siblings withclass="nfo"
.Demo from the shell: