My first scraping project with Python/Scrapy. Site is http://pabigtrees.com/ with 78 pages and 20 items (trees) per page. This is the full spider with a few changes to provide a minimal demonstration (scraping only one value per page):
import scrapy
from pabigtrees.items import Tree
class TreesSpider(scrapy.Spider):
name = "trees"
start_urls = ["http://pabigtrees.com/view_tree.aspx"]
allowed_domains = ["pabigtrees.com"]
download_delay = 2
def parse(self, response):
for page in [1,11,12]:
#for page in range(1,79):
if page == 1:
yield scrapy.FormRequest.from_response(
response,
#callback=self.parse_page
callback=self.parse_test
)
else:
yield scrapy.FormRequest.from_response(
response,
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
'__EVENTARGUMENT': "Page$" + str(page),
'ctl00$ContentPlaceHolder1$genus_latin': '0',
'ctl00$ContentPlaceHolder1$genus_common': '0',
'ctl00$ContentPlaceHolder1$county': '0',
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
'__SCROLLPOSITIONX': response.css('input#__SCROLLPOSITIONX::attr(value)').extract_first(),
'__SCROLLPOSITIONY': response.css('input#__SCROLLPOSITIONY::attr(value)').extract_first(),
'__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
},
#callback=self.parse_page
callback=self.parse_test
)
def parse_test(self, response):
yield {
'county':response.xpath('//a[contains(@href,"Select$1''")]/../../../td[5]/font/text()').extract_first()
}
def parse_page(self, response):
for tree in range(0,20):
yield scrapy.FormRequest.from_response(
response,
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
'__EVENTARGUMENT': "Select$" + str(tree)
}, meta={'county':response.xpath('//a[contains(@href,"Select$'+str(tree)+'")]/../../../td[5]/font/text()').extract_first()}, # save the county from the list page because it is not available on the detail page
callback=self.parse_results
)
def parse_results(self, response):
item = Tree()
genus = response.css('span#ctl00_ContentPlaceHolder1_tree_genus::text').extract()
species = response.css('span#ctl00_ContentPlaceHolder1_tree_species::text').extract()
circumference = response.css('span#ctl00_ContentPlaceHolder1_lblcircum::text').extract()
spread = response.css('span#ctl00_ContentPlaceHolder1_lblSpread::text').extract()
height = response.css('span#ctl00_ContentPlaceHolder1_lblHeight::text').extract()
points = response.css('span#ctl00_ContentPlaceHolder1_lblPoints::text').extract()
address = response.css('span#ctl00_ContentPlaceHolder1_lblAddress::text').extract()
crew = response.xpath('//td[text()="Measuring Crew: "]/following-sibling::td/text()').extract()
nominator = response.xpath('//td[text()="Original Nominator: "]/following-sibling::td/text()').extract()
comments = response.xpath('//td[text()="Comments: "]/following-sibling::td/text()').extract()
gps = response.xpath('//td[text()="GPS Coordinates: "]/following-sibling::td/text()').extract()
technique = response.css('span#ctl00_ContentPlaceHolder1_lblTech::text').extract()
yearnominated = response.css('span#ctl00_ContentPlaceHolder1_lblNom::text').extract()
yearlastmeasured = response.css('span#ctl00_ContentPlaceHolder1_lblMeasured::text').extract()
item['a_county'] = response.meta['county']
item['b_genus'] = genus
item['c_species'] = species
item['d_circumference'] = circumference
item['e_spread'] = spread
item['f_height'] = height
item['g_points'] = points
item['h_address'] = address
item['i_crew'] = crew
item['j_nominator'] = nominator
item['k_comments'] = comments
item['l_gps'] = gps
item['m_technique'] = technique
item['n_yearnominated'] = yearnominated
item['o_yearlastmeasured'] = yearlastmeasured
return item
The crawler works fine up through page 11. On page 12 and above, I get 500 errors. I believe it has something to do with the pagination, but I think I am sending the correct VIEWSTATE etc. Here’s the output:
(python3) Al-Green:pabigtrees Tony$ scrapy crawl trees -o trees.csv
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: pabigtrees)
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-14 15:31:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pabigtrees', 'FEED_FORMAT': 'csv', 'FEED_URI': 'trees.csv', 'NEWSPIDER_MODULE': 'pabigtrees.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pabigtrees.spiders']}
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-14 15:31:18 [scrapy.core.engine] INFO: Spider opened
2018-04-14 15:31:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-14 15:31:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-14 15:31:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://pabigtrees.com/robots.txt> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://pabigtrees.com/view_tree.aspx> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Dauphin'}
2018-04-14 15:31:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Delaware'}
2018-04-14 15:31:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 1 times): 500 Internal Server Error
2018-04-14 15:31:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 2 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 3 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.core.engine] DEBUG: Crawled (500) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:39 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://pabigtrees.com/view_tree.aspx>: HTTP status code is not handled or not allowed
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-14 15:31:39 [scrapy.extensions.feedexport] INFO: Stored csv feed (2 items) in: trees.csv
2018-04-14 15:31:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 134895,
'downloader/request_count': 7,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 5,
'downloader/response_bytes': 98019,
'downloader/response_count': 7,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/500': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 14, 19, 31, 39, 475017),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/500': 1,
'item_scraped_count': 2,
'log_count/DEBUG': 11,
'log_count/INFO': 9,
'memusage/max': 50180096,
'memusage/startup': 50176000,
'request_depth_max': 1,
'response_received_count': 5,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/500 Internal Server Error': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2018, 4, 14, 19, 31, 18, 563326)}
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Spider closed (finished)
I’m stumped, thanks!
The
__VIEWSTATE
is indeed what is causing you trouble.If you take a look at the navigation of the site you're trying to scrape, you'll see it only links to 10 other pages:
Those are the only 10 links of this search you're allowed to access from the current page (with the current view state). The next 10 will be accessible from page 11 of the search.
One possible solution would be to check in
parse_page()
if you're on page 11 (or 21, or 31...), and if so, create the requests for the next 10 pages.Also, you only need to populate the
formdata
you want to change,FormRequest.from_response()
will take care of the ones available in hidden input fields, such as e.g.__VIEWSTATE
or__EVENTVALIDATION
.